Conference Paper

Automated Feature Reduction in Machine Learning

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... In this case, a multi-segment feature selection method with a certain screening tendency is a solution. For the above problems, David Shilane proposed an automatic feature dimension reduction framework [4] , which elaborated the dimensionality reduction process of high-dimensional feature data by the framework from five aspects. However, the framework has certain limitations, and its screening method does not consider the characteristics of subsequent deep learning algorithms, so it cannot distinguish the features that are more effective for deep learning. ...
... This is contrary to the idea that we want the difference between various labels to be as large as possible in the classification task of deep learning, so such features are not helpful and need to be filtered out by specific methods. According to this idea, this paper proposes the filtering formula of data discrimination degree as shown in Formula (3) and Formula (4). ...
Preprint
Full-text available
In the training of the neural network model, the large number of features in the data set will lead to the complexity of the network model and high time cost. Therefore, the feature selection operation of the original data set is carried out to select the feature subset conducive to model training to improve the model's performance. The traditional feature selection algorithm has the problems of a thin process and needs help to eliminate the features with small discrimination. Therefore, this paper proposes the discrimination filtering formula and DI-CFS feature selection algorithm. The discrimination filtering formula can filter out invalid features and inefficient features with small discrimination. DI-CFS algorithm combines discrimination filtering formula, isolation forest algorithm, and improved CFS algorithm. On a set of wind turbine data, the DI-CFS algorithm and other traditional feature selection algorithms are used to select features from the data, respectively, and the obtained feature subsets are input into the same neural network model for training and performing classification tasks. The experimental results show that the discrimination filtering formula has a positive effect, and the DI-CFS algorithm has a better feature selection effect.
... Deep learning models provide considerable gains over traditional machine learning techniques during either the learning or forecast stages. These benefits include reducing the requirement for human guidance and the autonomous extraction of less noticeable elements [38], [39]. Supervisory learning, in which models are trained on data that has been labeled, and unsupervised learning, in which models are trained on data that has not been labeled, are the two primary categories used in machine learning. ...
Article
Full-text available
The rapid evolution of software, hardware, and internet technology has enabled the proliferation of internet-connected sensor tools that gather information and observations from the physical world. The IoT comprises billions of intelligent devices, extending physical and virtual boundaries. However, traditional data processing methods face significant challenges in handling the vast volume and variety of IoT data. This paper systematically reviews. These devices generate vast amounts of data daily, with diverse applications crucial for generating new knowledge, identifying future trends, and making informed decisions. This underscores IoT's value and enhances technology. Deep learning (DL) has significantly enhanced IoT and mobile applications, demonstrating promising outcomes. Its data-driven, anomaly-based approach for detecting emerging threats positions it well for IoT intrusion detection. This paper proposes a comprehensive framework leveraging DL techniques to address data processing challenges in IoT environments and enhance intelligence and application capabilities. Furthermore, this study systematically reviews and categorizes existing deep learning techniques applied in IoT, identifies critical challenges in IoT data processing, and provides actionable insights to inspire further research in this domain. It discusses the introduction of IoT and its data processing challenges and explores various DL approaches applied to IoT data. Significant DL efforts in IoT are surveyed and summarized, focusing on datasets, features, applications, and challenges to inspire further advancements in this field.
... These models are widely used in classification tasks, and each has its own unique characteristics and advantages. In comparison experiments, all models are trained using optimal parameters to ensure fairness and comparability of results [26][27][28][29]. ...
Article
Full-text available
Currently, diabetes is one of the most dangerous diseases in modern society. Prevention is an extremely important aspect in the field of medicine, and the field of artificial intelligence and the healthcare industry are penetrating and integrating with each other, and combining machine models for prediction and diagnosis of diabetes is a big trend. In order to validate the advantages and potential of XGBoost model in the field of diabetes prediction, this study identified 10 key features by processing a medical examination dataset containing 556,495 sample size. Among them, glycated hemoglobin has high clinical value as a predictor. By constructing six machine models (XGBoost, Decision Tree, Logistic Regression, Random Forest, CatBoost, and LightGBM) and comparing their performances, we finally obtained that: the performance of XGBoost is relatively the best, with accuracy of 97.5%, recall of 97%, F1 score of 96.9%, and ROC-AUC score of 0.971.
... On the other hand, feature selection/reduction [42,44] focuses on identifying the most relevant data points for training. This can be done through algorithms that assign weights to features based on their importance. ...
Article
Full-text available
Cervical cancer is one of the leading causes of death in women worldwide. Prompt and accurate diagnosis is imperative for the treatment of cervical cancer through the utilization of pap smear slides, albeit it is a multifaceted and time-intensive process. An automatic diagnosis model based on deep learning models, particularly a convolutional neural network (CNN), can enhance cervical cancer’s accuracy and rapid identification. This paper proposes a cross entropy-based multi-deep transfer learning model for the early detection and categorization of cervical cancer cells. The proposed model consists of four phases: the pre-processing phase, the feature extraction and fusion phase, the feature reduction phase, and the feature classification phase. In the pre-processing phase, cervical cancer input images are resized to 64 x 64 to match the input layer of the deep neural network. Feature extraction and fusion phase are adapted to extract features through different deep transfer learning models, including MobileNet, DenseNet, EfficientNet, Xception, RegNet, and ResNet-50, followed by a fusion process for all extracted features. As for the feature reduction process phase, Principal Component Analysis (PCA) is applied as a feature reduction technique. Finally, a pipeline of three dense layers completes the classification process. A novel loss function termed smoothing cross-entropy is presented to enhance classification performance. The performance of the proposed model is validated using benchmark datasets, namely the SIPaKMeD dataset. According to the results, the suggested model attains a remarkable accuracy of 97% for the SIPaKMeD datasets using 676 features.
... This method increases the data dimensionality. Data can be processed further using feature reduction methods such as Auto Encoder (AE), which saturates the data and decreases their dimensionality [44]. ...
Article
Full-text available
Human gait activity recognition is an emerging field of motion analysis that can be applied in various application domains. One of the most attractive applications includes monitoring of gait disorder patients, tracking their disease progression and the modification/evaluation of drugs. This paper proposes a robust, wearable gait motion data acquisition system that allows either the classification of recorded gait data into desirable activities or the identification of common risk factors, thus enhancing the subject’s quality of life. Gait motion information was acquired using accelerometers and gyroscopes mounted on the lower limbs, where the sensors were exposed to inertial forces during gait. Additionally, leg muscle activity was measured using strain gauge sensors. As a matter of fact, we wanted to identify different gait activities within each gait recording by utilizing Machine Learning algorithms. In line with this, various Machine Learning methods were tested and compared to establish the best-performing algorithm for the classification of the recorded gait information. The combination of attention-based convolutional and recurrent neural networks algorithms outperformed the other tested algorithms and was individually tested further on the datasets of five subjects and delivered the following averaged results of classification: 98.9% accuracy, 96.8% precision, 97.8% sensitivity, 99.1% specificity and 97.3% F1-score. Moreover, the algorithm’s robustness was also verified with the successful detection of freezing gait episodes in a Parkinson’s disease patient. The results of this study indicate a feasible gait event classification method capable of complete algorithm personalization.
Book
Cambridge Core - Pattern Recognition and Machine Learning - The Art of Feature Engineering - by Pablo Duboue
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Chapter
IntroductionStudy DesignOrdinal Outcome ScaleVariable ClusteringThe Proportional Odds Model and Developing Cluster Summary ScoresAssessing Ordinality of Y for Each X, and Unadjusted Checking of PO and CR AssumptionsA Tentative Full Proportional Odds ModelResiduals for Checking the Proportional Odds AssumptionContinuation Ratio Ordinal Logistic ModelExtended Continuation Ratio ModelPenalized EstimationUsing Approximations to Simplify the ModelValidating the ModelSummaryAppendixAcknowledgementsReferences
Article
MACHINE LEARNING SYSTEMS automatically learn programs from data. This is often a very attractive alternative to manually constructing them, and in the last decade the use of machine learning has spread rapidly throughout computer science and beyond. Machine learning is used in Web search, spam filters, recommender systems, ad placement, credit scoring, fraud detection, stock trading, drug design, and many other applications. A recent report from the McKinsey Global Institute asserts that machine learning (a.k.a. data mining or predictive analytics) will be the driver of the next big wave of innovation. 15 Several fine textbooks are available to interested practitioners and researchers (for example, Mitchell 16 and Witten et al. 24). However, much of the "folk knowledge" that is needed to successfully develop machine learning applications is not readily available in them. As a result, many machine learning projects take much longer than necessary or wind up producing less-than-ideal results. Yet much of this folk knowledge is fairly easy to communicate. This is the purpose of this article.
Article
Modern advances in computing power have greatly widened scientists' scope in gathering and investigating information from many variables, information which might have been ignored in the past. Yet to effectively scan a large pool of variables is not an easy task, although our ability to interact with data has been much enhanced by recent innovations in dynamic graphics. In this article, we propose a novel data-analytic tool, sliced inverse regression (SIR), for reducing the dimension of the input variable x without going through any parametric or nonparametric model-fitting process. This method explores the simplicity of the inverse view of regression; that is, instead of regressing the univariate output variable y against the multivariate x, we regress x against y. Forward regression and inverse regression are connected by a theorem that motivates this method. The theoretical properties of SIR are investigated under a model of the form, y = f(β1x, …, βKx, ε), where the βk's are the unknown row vectors. This model looks like a nonlinear regression, except for the crucial difference that the functional form of f is completely unknown. For effectively reducing the dimension, we need only to estimate the space [effective dimension reduction (e.d.r.) space] generated by the βk's. This makes our goal different from the usual one in regression analysis, the estimation of all the regression coefficients. In fact, the βk's themselves are not identifiable without a specific structural form on f. Our main theorem shows that under a suitable condition, if the distribution of x has been standardized to have the zero mean and the identity covariance, the inverse regression curve, E(x | y), will fall into the e.d.r. space. Hence a principal component analysis on the covariance matrix for the estimated inverse regression curve can be conducted to locate its main orientation, yielding our estimates for e.d.r. directions. Furthermore, we use a simple step function to estimate the inverse regression curve. No complicated smoothing is needed. SIR can be easily implemented on personal computers. By simulation, we demonstrate how SIR can effectively reduce the dimension of the input variable from, say, 10 to K = 2 for a data set with 400 observations. The spin-plot of y against the two projected variables obtained by SIR is found to mimic the spin-plot of y against the true directions very well. A chi-squared statistic is proposed to address the issue of whether or not a direction found by SIR is spurious.
Article
Categorical data fields characterized by a large number of distinct values represent a serious challenge for many classification and regression algorithms that require numerical inputs. On the other hand, these types of data fields are quite common in real-world data mining applications and often contain potentially relevant information that is difficult to represent for modeling purposes.This paper presents a simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression. The proposed method is based on a well-established statistical method (empirical Bayes) that is straightforward to implement as an in-database procedure. Furthermore, for categorical attributes with an inherent hierarchical structure, like ZIP codes, the preprocessing scheme can directly leverage the hierarchy by blending statistics at the various levels of aggregation.While the statistical methods discussed in this paper were first introduced in the mid 1950's, the use of these methods as a preprocessing step for complex models, like neural networks, has not been previously discussed in any literature.
Article
In this article, we describe how the theory of sufficient dimension reduction, and a well-known inference method for it (sliced inverse regression), can be extended to regression analyses involving both quantitative and categorical predictor variables. As statistics faces an increasing need for effective analysis strategies for high-dimensional data, the results we present significantly widen the applicative scope of sufficient dimension reduction and open the way for a new class of theoretical and methodological developments.
Article
This paper describes the methodologies used to develop a prediction model to assist health workers in developing countries in facing one of the most difficult health problems in all parts of the world: the presentation of an acutely ill young infant. Statistical approaches for developing the clinical prediction model faced at least two major difficulties. First, the number of predictor variables, especially clinical signs and symptoms, is very large, necessitating the use of data reduction techniques that are blinded to the outcome. Second, there is no uniquely accepted continuous outcome measure or final binary diagnostic criterion. For example, the diagnosis of neonatal sepsis is ill-defined. Clinical decision makers must identify infants likely to have positive cultures as well as to grade the severity of illness. In the WHO/ARI Young Infant Multicentre Study we have found an ordinal outcome scale made up of a mixture of laboratory and diagnostic markers to have several clinical advantages as well as to increase the power of tests for risk factors. Such a mixed ordinal scale does present statistical challenges because it may violate constant slope assumptions of ordinal regression models. In this paper we develop and validate an ordinal predictive model after choosing a data reduction technique. We show how ordinality of the outcome is checked against each predictor. We describe new but simple techniques for graphically examining residuals from ordinal logistic models to detect problems with variable transformations as well as to detect non-proportional odds and other lack of fit. We examine an alternative type of ordinal logistic model, the continuation ratio model, to determine if it provides a better fit. We find that it does not but that this model is easily modified to allow the regression coefficients to vary with cut-offs of the response variable. Complex terms in this extended model are penalized to allow only as much complexity as the data will support. We approximate the extended continuation ratio model with a model with fewer terms to allow us to draw a nomogram for obtaining various predictions. The model is validated for calibration and discrimination using the bootstrap. We apply much of the modelling strategy described in Harrell, Lee and Mark (Statist. Med. 15, 361-387 (1998)) for survival analysis, adapting it to ordinal logistic regression and further emphasizing penalized maximum likelihood estimation and data reduction.
formulaic: Dynamic Generation and Quality Checks of Formula Objects
  • shilane
Principal Component Analysis
  • I T Joliffe
formulaic: Dynamic Generation and Quality Checks of Formula Objects
  • D Shilane
  • C Lee
  • Z Huang
  • Nelson