Article

Feature extraction: foundations and applications

Authors:
  • Université Paris-Saclay and INRIA
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Experts predict advanced plant system understanding through tools such as genomics, proteomics, transcriptomics, and metabolomics, with data integration across these levels offering a holistic view of molecular interactions leading plant responses to abiotic stress [34]. [12,20,[35][36][37][38][39][40][41][42][43]. ...
... Supervised ML uses diverse features such as amino acid sequences and physicochemical properties for training data representation [41]. Feature selection is crucial, with three methods: filter, wrapper, and embedding [42,43,127]. Algorithm selection is fundamental in ML, categorized into supervised (establishing input-output relationships from training data), unsupervised (identifying data patterns without known outcomes, e.g., clustering and dimension reduction) [128,129], and semi-supervised (handling labeled and unlabeled data) [130]. ...
... Bioinformatics databases for plant stress and crop improvement. Modified from[12,20,[35][36][37][38][39][40][41][42][43]. ...
Article
Full-text available
Abiotic stresses, including drought, salinity, extreme temperatures and nutrient deficiencies, pose significant challenges to crop production and global food security. To combat these challenges, the integration of bioinformatics educational tools and AI applications provide a synergistic approach to identify and analyze stress-responsive genes, regulatory networks and molecular markers associated with stress tolerance. Bioinformatics educational tools offer a robust framework for data collection, storage and initial analysis, while AI applications enhance pattern recognition, predictive modeling and real-time data processing capabilities. This review uniquely integrates bioinformatics educational tools and AI applications, highlighting their combined role in managing abiotic stress in plants and crops. The novelty is demonstrated by the integration of multiomics data with AI algorithms, providing deeper insights into stress response pathways, biomarker discovery and pattern recognition. Key AI applications include predictive modeling of stress resistance genes, gene regulatory network inference, omics data integration and real-time plant monitoring through the fusion of remote sensing and AI-assisted phenomics. Challenges such as handling big omics data, model interpretability, overfitting and experimental validation remain there, but future prospects involve developing user-friendly bioinformatics educational platforms, establishing common data standards, interdisciplinary collaboration and harnessing AI for real-time stress mitigation strategies in plants and crops. Educational initiatives, interdisciplinary collaborations and trainings are essential to equip the next generation of researchers with the required skills to utilize these advanced tools effectively. The convergence of bioinformatics and AI holds vast prospects for accelerating the development of stress-resilient plants and crops, optimizing agricultural practices and ensuring global food security under increasing environmental pressures. Moreover, this integrated approach is crucial for advancing sustainable agriculture and ensuring global food security amidst growing environmental challenges.
... Методи відбору інформативних показників можна поділити на три групи: методи-фільтри, методи-обгортки та вбудовані методи [3,4,5]. ...
... Серед методів-фільтрів в регресійному аналізі знайшов застосування метод, який базується на аналізі кореляційних зв'язків цільового показника з незалежними [3,4,5]. Оцінюється коефіцієнт кореляції Пірсона між цільовим показником та кожним з незалежних, далі незалежні показники ранжуються за модулем коефіцієнта кореляції. ...
... Проте процес пошуку, який передбачає перебір великої кількості підмножин показників, є досить витратним за часом, тому застосування методів цієї групи не завжди можливе. До методів-обгорток відносяться методи прямого відбору, зворотного вилучення та покрокового відбору [3,4,5]. Покроковий відбір поєднує елементи двох попередніх, в ньому послідовно виконуються кроки додавання та вилучення показників. ...
Article
Full-text available
У статті розглянуто технологію побудови моделі регресії для прогнозування максимальної корегованої гостроти зору на основі показників медичного обстеження пацієнтів з непроліферативною діабетичною ретинопатією. Технологія передбачала на першому етапі проведення попередньої обробки та аналізу вхідних даних, а на наступних етапах – побудову моделі лінійної багатовимірної регресії зі здійсненням відбору інформативних показників методом-фільтром та покроковим відбором. Всі етапи технології було програмно реалізовано у створеному вебдодатку. Серверну частину вебдодатку розроблено за допомогою мови Python та бібліотеки Flask, клієнтську частину – з використанням HTML, CSS та JavaScript.
... Filter methods represent a category of feature selection techniques that assess the relevance of each feature individually. Notable examples include statistical tests like chi-squared tests, mutual information, and correlation-based feature selection (CFS) [47], [48]. These methods generate a ranked list of features based on their individual significance, making them particularly suitable for datasets with a high number of features. ...
... Embedded methods integrate feature selection within the classifier algorithm itself during training [47]. Examples include decision tree-based algorithms (e.g., decision tree, random forest, gradient boosting) and regularisation models like LASSO or elastic net. ...
Article
Full-text available
Educational Data Mining (EDM) is used to ameliorate the teaching and learning process by analyzing and classifying data that can be applied to predict the students’ academic performance, and students’ dropout rate, as well as instructors’ performance. The prediction of student performance is complicated by the vast and diverse range of variables from academic records to behavioral and health metrics. In this paper, we have introduced a new Adaptive Feature Selection Algorithm (AFSA) by amalgamating an ensemble approach for initial feature ranking with normalized mean ranking from five distinct methods to enhance robustness. The proposed method iteratively selects the best features by adjusting its threshold based on each feature’s rank to ensure significant contributions to model accuracy and also effectively reduces dataset complexity. We have tested the performance of the proposed feature selection algorithm using five machine learning classifiers: Logistic Regression (LR), K-Nearest Neighbour (KNN), Support Vector Machine (SVM), Naïve Bayes (NB) classifier, and Decision Tree (DT) classifier on four student performance datasets. The experimental results highlight the proposed method significantly decreases feature count by an average feature reduction factor of 5.7, significantly streamlining datasets while maintaining competitive cross-validation accuracy, marking it as a valuable tool in the field of educational data analytics.
... Many classification algorithms are fairly robust to the inclusion of potentially noisy or irrelevant features, and their predictive power may or may not be severely affected; however, reducing the number of features often improves the model's predictive power for hold-out data. A reduced feature subset also facilitates inference, enabling one to gain insights into the problem via analysis of the most predictive features [22], [23]. ...
... Exhaustive search through all possible feature subsets is computationally intractable, a problem which has led to the development of feature selection (FS) algorithms which offer a rapid, principled approach to reduction of the number of features. FS is a topic of extensive research, and we refer to Guyon et al. [23] for further details. ...
Article
Full-text available
There has been considerable recent research into the connection between Parkinson's disease (PD) and speech impairment. Recently, a wide range of speech signal processing algorithms (dysphonia measures) aiming to predict PD symptom severity using speech signals have been introduced. In this paper, we test how accurately these novel algorithms can be used to discriminate PD subjects from healthy controls. In total, we compute 132 dysphonia measures from sustained vowels. Then, we select four parsimonious subsets of these dysphonia measures using four feature selection algorithms, and map these feature subsets to a binary classification response using two statistical classifiers: random forests and support vector machines. We use an existing database consisting of 263 samples from 43 subjects, and demonstrate that these new dys-phonia measures can outperform state-of-the-art results, reaching almost 99% overall classification accuracy using only ten dyspho-nia features. We find that some of the recently proposed dysphonia measures complement existing algorithms in maximizing the ability of the classifiers to discriminate healthy controls from PD subjects. We see these results as an important step toward noninvasive diagnostic decision support in PD. Index Terms-Decision support tool, feature selection (FS), Parkinson's disease (PD), nonlinear speech signal processing, random forests (RF), support vector machines (SVM).
... In machine learning, feature extraction and selection are two crucial steps that reduce the dimensionality of raw data and eliminate redundant information (Guyon et al., 2008). As the selected features simplify the data and better reflect its essence, models are simplified, and the required data volume is consequently reduced . ...
... To optimize model performance and enhance interpretability, we employed Recursive Elimination with Cross-Validation with Random Forest (RFECV) 14 for feature selection. This robust approach iteratively removes the least informative feature while ensuring generalizability through k-fold cross-validation. ...
Article
Full-text available
Objective This study aims to create a robust and interpretable method for predicting dementia in Parkinson's disease (PD), especially in resource-limited settings. The model aims to be accurate even with small datasets and missing values, ultimately promoting its use in clinical practice to benefit patients and medical professionals. Methods Our study introduces LightGBM–TabPFN, a novel hybrid model for predicting dementia conversion in PD. Combining LightGBM's strength in handling missing values with TabPFN's ability to exploit small datasets, LightGBM–TabPFN outperforms seven existing methods, achieving outstanding accuracy and interpretability thanks to SHAP analysis. This analysis leverages data from 242 PD patients across 17 variables. Results Our LightGBM–TabPFN model significantly outperformed seven existing methods. Achieving an accuracy of 0.9592 and an area under the ROC curve of 0.9737. Conclusions The interpretable LightGBM–TabPFN with SHAP signifies a significant advancement in predictive modeling for neurodegenerative diseases. This study not only improves dementia prediction in PD but also provides clinical professionals with insights into model predictions, offering opportunities for application in clinical settings.
... This involved cleaning the datasets to remove noise, handling missing values, and standardizing formats. Additionally, feature engineering was performed to extract meaningful molecular descriptors, ensuring the representation of relevant chemical and biochemical information [11]. ...
Article
Artificial intelligence (AI) has emerged as a transformative force in drug discovery, revolutionizing traditional approaches in chemical and biochemical sciences. This paper explores the significance, benefits, and limitations of AI in the context of drug discovery, emphasizing its role in accelerating the identification of therapeutic candidates and optimizing existing drugs
... This involved cleaning the datasets to remove noise, handling missing values, and standardizing formats. Additionally, feature engineering was performed to extract meaningful molecular descriptors, ensuring the representation of relevant chemical and biochemical information [11]. ...
Article
Full-text available
Artificial intelligence (AI) has emerged as a transformative force in drug discovery, revolutionizing traditional approaches in chemical and biochemical sciences. This paper explores the significance, benefits, and limitations of AI in the context of drug discovery, emphasizing its role in accelerating the identification of therapeutic candidates and optimizing existing drugs. Leveraging diverse sets of chemical and biochemical data, sourced from reputable databases and literature, the study employs advanced machine learning and deep learning algorithms for predictive modeling. Key AI-driven outcomes include target identification and validation, virtual screening results, molecular docking scores, compound design, optimization, and high-throughput screening automation. The findings showcase superior performance compared to traditional methods, emphasizing the efficiency and accuracy of AI-driven drug discovery. However, challenges such as data quality and ethical considerations underscore the need for ongoing research and development. The paper concludes with insights into collaborative opportunities and areas for further development, highlighting AI's potential impact on personalized medicine and its integration into drug development pipelines. Two key themes, AI and drug discovery, encapsulate the essence of this comprehensive exploration into the current state and future directions of AI in the pharmaceutical domain. Keywords: Artificial Intelligence, Drug Discovery, Predictive Modeling, Machine Learning, Deep Learning, High-Throughput Screening, Molecular Docking
... Feature extraction is used to reduce datasets into their informative and non-redundant parts. For more details, we refer to [64]. They usually involve some form of dimensionality reduction, and common techniques of this kind are principal component analysis (PCA) [65], autoencoders [66], and clustering techniques such as k-means clustering. ...
Preprint
Full-text available
This review article highlights state-of-the-art data-driven techniques to discover, encode, surrogate, or emulate constitutive laws that describe the path-independent and path-dependent response of solids. Our objective is to provide an organized taxonomy to a large spectrum of methodologies developed in the past decades and to discuss the benefits and drawbacks of the various techniques for interpreting and forecasting mechanics behavior across different scales. Distinguishing between machine-learning-based and model-free methods, we further categorize approaches based on their interpretability and on their learning process/type of required data, while discussing the key problems of generalization and trustworthiness. We attempt to provide a road map of how these can be reconciled in a data-availability-aware context. We also touch upon relevant aspects such as data sampling techniques, design of experiments, verification, and validation.
... Feature engineering is a critical aspect of the success of an ML project (Domingos, 2012) because it helps to reduce the training cost and improves prediction accuracy (Kuhn and Johnson, 2019). The two main steps of feature engineering are feature selection (Cai et al., 2018) and feature extraction (Guyon et al., 2008). Feature selection allows reducing the number of features by keeping only the relevant ones. ...
... neighborhood size (k), which defines the neighborhood for local density calculation, and contamination (c), which specifies the proportion of outliers in the dataset (Breunig et al., 2000;Chandola, Banerjee, and Kumar, 2009 can be divided into three groups: filter, wrapper, and embedded methods (Guyon et al., 2008;Chandrashekar and Sahin, 2014;Jović, Brkić, and Bogunović, 2015;Brownlee, 2016a). ...
Thesis
Full-text available
The number of extrasolar planets discovered is increasing, so that more than five thousand exoplanets have been confirmed to date. Now we have an opportunity to test the validity of the laws governing planetary systems and take steps to discover the relationships between the physical parameters of planets and stars. Firstly, we present the results of a search for additional exoplanets in 229 multi-planetary systems that house at least three or more confirmed planets, employing a logarithmic spacing between planets in our Solar System known as the Titius-Bode (TB) relation. We find that the planets in ∼53% of these systems adhere to a logarithmic spacing relation remarkably better than the Solar System planets. We predict the presence of 426 additional exoplanets, 47 of which are located within the habitable zone (HZ), and five of the 47 planets have a maximum mass limit of 0.1-2 M⊕ and a maximum radius lower than 1.25 R⊕. Secondly, we employ efficient machine learning approaches to analyze a dataset comprising 762 confirmed exoplanets and eight Solar System planets, aiming to characterize their fundamental quantities. We classify the data into two main classes: 'small' and 'giant' planets, with cut-off values at Rp=8.13R⊕ and Mp=52.48M⊕. Giant planets have lower densities, suggesting higher H-He mass fractions, while small planets are denser, composed mainly of heavier elements. We highlight that planetary mass, orbital period, and stellar mass play crucial roles in predicting exoplanet radius. Notably, our study reveals a noteworthy result: for giant planets, we observe a strong correlation between planetary radius and the mass of their host stars, which might provide intriguing insights into the relationship between giant planet formation and stellar characteristics.
... Feature selection methods play a vital role in biomedical data analysis, helping to identify the most relevant features contributing to a health outcome while eliminating noise, redundancy, and irrelevant factors [5][6][7]. Biomedical datasets collected from human biosamples often contain many features, some of which may be irrelevant to the outcome of interest. Analysing all features can lead to overfitting, reduced accuracy, and a less concise understanding of the underlying biological processes [5,8,9]. ...
Article
Full-text available
Early diagnosis of dementia diseases, such as Alzheimer's disease, is difficult because of the time and resources needed to perform neuropsychological and pathological assessments. Given the increasing use of machine learning methods to evaluate neuropathology features in the brains of dementia patients, it is important to investigate how the selection of features may be impacted and which features are most important for the classification of dementia. We objectively assessed neuropathology features using machine learning techniques for filtering features in two independent ageing cohorts, the Cognitive Function and Aging Studies (CFAS) and Alzheimer's Disease Neuroimaging Initiative (ADNI). The reliefF and least loss methods were most consistent with their rankings between ADNI and CFAS; however, reliefF was most biassed by feature–feature correlations. Braak stage was consistently the highest ranked feature and its ranking was not correlated with other features, highlighting its unique importance. Using a smaller set of highly ranked features, rather than all features, can achieve a similar or better dementia classification performance in CFAS (60%–70% accuracy with Naïve Bayes). This study showed that specific neuropathology features can be prioritised by feature filtering methods, but they are impacted by feature–feature correlations and their results can vary between cohort studies. By understanding these biases, we can reduce discrepancies in feature ranking and identify a minimal set of features needed for accurate classification of dementia.
... More often than not, transformation is done to raw data before it is fed into a model to ensure the desired output. This is referred to as feature transformation or engineering (Guyon et al., 2008). Signal processing methods have been known to "break down" signals into its constituent parts, such that valuable insights can be gleaned. ...
Article
Full-text available
The forecast of wind speed and the power produced from wind farms has been a challenge for a long time and continues to be so. This work introduces a method that we label as Wavelet Decomposition-Neural Networks (WDNN) that combines wavelet decomposition principles and deep learning. By merging the strengths of signal processing and machine learning, this approach aims to address the aforementioned challenge. Treating wind speed and power as signals, the wavelet decomposition part of the model transforms these inputs, as appropriate, into a set of features that the neural network part of the model can ingest to output accurate forecasts. WDNN is unconstrained by the shape, layout, or number of turbines in a wind farm. We test our WDNN methods using three large datasets, with multiple years of data and hundreds of turbines, and compare it against other state-of-the-art methods. It’s very short-term forecast, like 1-h ahead, can outperform some deep learning models by as much as 30%. This shows that wavelet decomposition and neural network are a potent combination for advancing the quality of short-term wind forecasting.
... Feature selection [33] is vital for increasing classifier accuracy, saving data-collecting effort, improving model interpretability, and shortening prediction time. Feature importance scores are crucial in a predictive modeling task because they give an understanding of the data and the models [34]. ...
Article
Full-text available
One of the most significant research areas in education and Artificial Intelligence (AI) is the earlier prediction of students’ academic achievement. Limited studies have been conducted using Deep Learning (DL) in the student domain of Intelligent Tutoring System (ITS). Traditional Machine Learning (ML) techniques have been employed in many earlier publications to predict student performance. This paper investigates the effectiveness of DL algorithms for predicting student academic performance. Three different DL architectures based on the structure of Convolutional Neural Networks (CNN) are presented. Two public datasets are used. Furthermore, two feature selection techniques are utilized in this experiment: Principal Component Analysis (PCA) and Decision Trees (DTs). Also, we applied a resampling technique for the first dataset to address the issue of an imbalanced dataset. According to the experimental findings, the proposed CNN model’s success in predicting student performance at early stages reached an accuracy of 94.36% using the first dataset and 84.83% using the second dataset. Comparing the proposed approach with the previous studies, the proposed approach outperformed all previous studies when dataset 2 and part of dataset 1 were used. For the complete dataset 1, the proposed model performed very well.
... While CMIM is highly effective in predicting class labels, it does not consider selecting attributes that are similar to the pre-selected attributes. In another information theoretical based method, such as Double Input Symmetrical Relevance (DISR), normalization methods are used to normalize mutual information and it has been stated that DISR is a nonlinear combination of Shannon information terms and is defined as conditional likelihood maximization (Meyer and Bontempi 2006;Guyon et al. 2006). Fast correlation based filter (FCBF) is a feature selection method that uses feature-class correlation and feature-feature relationship in combination, and it is not amenable to being expressed as a combined conditional likelihood maximization method based on different information theory (Yu and Liu 2003). ...
Article
Full-text available
Feature selection is an important factor of accurately classifying high dimensional data sets by identifying relevant features and improving classification accuracy. The use of feature selection in operations research allows for the identification of relevant features and the creation of optimal subsets of features for improved predictive performance. This paper proposes a novel feature selection algorithm inspired from ensemble pruning which involves the use of second-order conic programming modeled as an embedded feature selection technique with neural networks, named feature selection via second order cone programming (FSOCP). The proposed FSOCP algorithm trains features individually on a neural network and generates a probability class distribution and prediction, allowing the second-order conic programming model to determine the most important features for improved classification accuracies. The algorithm is evaluated on multiple synthetic data sets and compared with other feature selection techniques, demonstrating its promising potential as a feature selection approach.
... Feature Extraction: The feature extraction task involves identifying a concise and meaningful collection of features, enhancing data efficiency, and facilitating storage and processing [45]. Different feature extraction methods are used for image, video, or textual data analysis. ...
Preprint
Full-text available
Customers are the most critical component in a business’s success regardless of the industry or product. Companies make significant efforts to acquire and, more importantly, retain their existing customers. Customer churn is a significant challenge for businesses, leading to financial losses. To address this challenge, understanding customer’s cognitive status, behaviors, and early signs of churn is crucial. However, predictive and ML-based analysis, being fed with proper features that are indicative of a customer’s cognitive status or behavior, is extremely helpful in addressing this challenge. Having practical ML-based analysis relies on a well-developed feature engineering process. Previous churn analytical studies mainly applied feature engineering approaches that leveraged demographic, product usage, and revenue features alone, and there is a lack of research on leveraging the information-rich content from interactions between customers and companies. Considering the effectiveness of applying domain knowledge and human expertise in feature engineering, and motivated by our previous work, we propose a Customer Churn-related Knowledge Base (ChurnKB) to enhance the feature engineering process. In the ChurnKB, we leverage textual data mining techniques for extracting churn-related features from texts created by customers, e.g., emails or chat logs with company agents, reviews on the company’s website, and feedback on social media. We use Generative AI (GAI) to enhance and enrich the structure of the ChurnKB regarding features related to customer churn-related cognitive status, feelings, and behaviors. We also leveraged feedback loops and crowdsourcing to enhance and approve the validity of the proposed ChurnKB and apply it to develop a classifier for customer churn problems.
... We start by getting data from Android apps with a dataset, such as Drebin or CICAn-dMal2017. Next, data pre-processing techniques, namely, techniques to handle missing values, for numerosity balancing and feature selection [63,64], are applied to properly prepare the data and to assess their impact on the model's performance. Additionally, a set of the most relevant features will be obtained with a feature selection technique. ...
Article
Full-text available
The presence of malicious software (malware), for example, in Android applications (apps), has harmful or irreparable consequences to the user and/or the device. Despite the protections app stores provide to avoid malware, it keeps growing in sophistication and diffusion. In this paper, we explore the use of machine learning (ML) techniques to detect malware in Android apps. The focus is on the study of different data pre-processing, dimensionality reduction, and classification techniques, assessing the generalization ability of the learned models using public domain datasets and specifically developed apps. We find that the classifiers that achieve better performance for this task are support vector machines (SVM) and random forests (RF). We emphasize the use of feature selection (FS) techniques to reduce the data dimensionality and to identify the most relevant features in Android malware classification, leading to explainability on this task. Our approach can identify the most relevant features to classify an app as malware. Namely, we conclude that permissions play a prominent role in Android malware detection. The proposed approach reduces the data dimensionality while achieving high accuracy in identifying malware in Android apps.
... There are some books discussing about FE and FS as in [24][25][26][27]. ...
Article
Full-text available
This paper discusses the critical decision process of extracting or selecting the features in a supervised learning context. It is often confusing to find a suitable method to reduce dimensionality. There are pros and cons to deciding between a feature selection and feature extraction according to the data’s nature and the user’s preferences. Indeed, the user may want to emphasize the results toward integrity or interpretability and a specific data resolution. This paper proposes a new method to choose the best dimensionality reduction method in a supervised learning context. It also helps to drop or reconstruct the features until a target resolution is reached. This target resolution can be user defined, or it can be automatically defined by the method. The method applies a regression or a classification, evaluates the results, and gives a diagnosis about the best dimensionality reduction process in this specific supervised learning context. The main algorithms used are the random forest algorithms, the principal component analysis algorithm, and the multilayer perceptron neural network algorithm. Six use cases are presented, and every one is based on some well-known technique to generate synthetic data. This research also discusses each choice that can be made in the process, aiming to clarify the issues about the entire decision process of selecting or extracting the features.
... At the same time, numerous methods have been proposed to cope with such noisy and irrelevant features, stemming from feature selection (Guyon et al., 2006;Chandrashekar & Sahin, 2014) or metric learning (Kulis, 2013;Wang & Sun, 2015) techniques. Filter methods to feature selection are a popular candidate since they are fast to compute. ...
Article
Full-text available
Symbolic event recognition systems detect event occurrences using first-order logic rules. Although existing online structure learning approaches ease the discovery of such rules in noisy data streams, they assume the existence of fully labelled training data. Splice is a recent online graph-based approach that estimates the labels of unlabelled data and makes it possible to learn such rules from semi-supervised training sequences of logical interpretations. However, Splice labelling depends significantly on the metric used to compute the distances of unlabelled examples to their labelled counterparts. Moreover, there is no guarantee about the quality of the labelling found in the local graphs that are built as the data stream in. In this paper, we propose a new online learning method, which includes an enhanced hybrid measure that combines an optimised structural distance, and a data-driven one. The former is guided by feature selection targeted to kNN classification, while the latter is a mass-based dissimilarity. Additionally, the enhanced Splice method, improves the graph construction process, by storing a synopsis of the past, in order to achieve more informed labelling on the local graphs. We evaluate our approach by learning Event Calculus theories for the tasks of human activity recognition, maritime monitoring, and fleet management. The evaluation suggests that our approach outperforms its predecessor, in terms of inferring the missing labels and improving the predictive accuracy of the underlying structure learning system.
... Feature selection methods can be roughly classified into five groups: wrapper, filtering, embedded, ensemble, and hybrid models [4][5][6][7][8][9][10]. Filtering methods use an indirect criterion to measure the success of the predictions and rely on statistical approaches. ...
... However, high-dimensional data has become a real challenge for predictive tasks in DM and ML algorithms because the data may come from multiple sources, which in turn affects performance, is computationally intensive, and introduces the problem of overfitting. Therefore, in order to remove irrelevant and redundant features, feature selection is a necessary preprocessing step of the classification process, which serves to reduce computation time and improve learning accuracy, especially for high-dimensional datasets [49] [50]. ...
Article
Full-text available
In a competitive digital age where data volumes are increasing with time, the ability to extract meaningful knowledge from high-dimensional data using machine learning (ML) and data mining (DM) techniques and making decisions based on the extracted knowledge is becoming increasingly important in all business domains. Nevertheless, high-dimensional data remains a major challenge for classification algorithms due to its high computational cost and storage requirements. The 2016 Demographic and Health Survey of Ethiopia (EDHS 2016) used as the data source for this study which is publicly available contains several features that may not be relevant to the prediction task. In this paper, we developed a hybrid multidimensional metrics framework for predictive modeling for both model performance evaluation and feature selection to overcome the feature selection challenges and select the best model among the available models in DM and ML. The proposed hybrid metrics were used to measure the efficiency of the predictive models. Experimental results show that the decision tree algorithm is the most efficient model. The higher HMM (m, r) = 0.47 score illustrates the overall significant model that encompasses almost all the user’s requirements, unlike the classical metrics that use a criterion to select the most appropriate model. On the other hand, the ANNs were found to be the most computationally intensive for our prediction task. Moreover, the type of data and the class size of the dataset (unbalanced data) have a significant impact on the model's efficiency, especially on the computational cost, and the interpretability of the model's parameters would be hampered. The efficiency of the predictive model could be improved with other feature selection algorithms (especially hybrid metrics) considering the experts of the knowledge domain, as the understanding of the business domain has a significant impact. Keywords Predictive Modeling, Hybrid Metrics, Feature Selection, Model Selection, Algorithm Analysis, Machine Learning
... However, for these types of models, the analysis usually focuses on prediction accuracy rather than the interpretability of the model (Feurer and Hutter, 2019). To implement interpretability, dimensionality reduction can be introduced for supervised and unsupervised (Azencott, 2018) problems through feature selection and feature extraction (Dy et al., 2000;Guyon and Elisseeff, 2003;Guyon et al., 2008). In addition, Vellido (Alcacena et al., 2011) stated that information visualization is a feasible solution to interpret the machine learning models such as Partial Dependency Plots (PDP) (Greenwell, 2017) and Shapley Additive explanation (SHAP) (Mangalathu et al., 2020). ...
Article
Full-text available
Heat exchanger modeling has been widely employed in recent years for performance calculation, design optimizations, real-time simulations for control analysis, as well as transient performance predictions. Among these applications, the model’s computational speed and robustness are of great interest, particularly for the purpose of optimization studies. Machine learning models built upon experimental or numerical data can contribute to improving the state-of-the-art simulation approaches, provided careful consideration is given to algorithm selection and implementation, to the quality of the database, and to the input parameters and variables. This comprehensive review covers machine learning methods applied to heat exchanger applications in the last 8 years. The reviews are generally categorized based on the types of heat exchangers and also consider common factors of concern, such as fouling, thermodynamic properties, and flow regimes. In addition, the limitations of machine learning methods for heat exchanger modeling and potential solutions are discussed, along with an analysis of emerging trends. As a regression classification tool, machine learning is an attractive data-driven method to estimate heat exchanger parameters, showing a promising prediction capability. Based on this review article, researchers can choose appropriate models for analyzing and improving heat exchanger modeling.
... Wrapper, filter, and embedded method are the three techniques to select features. 28 Since the embedded technique combines the advantages of both the filter method and the wrapper method, it serves as a middle ground solution. 29 Particularly, the embedded technique is computationally simpler than the wrapper method while remaining computationally intensive compared to the filter method. ...
Article
Full-text available
Objective The Eastern Cooperative Oncology Group performance status (ECOG PS) is a widely recognized measure used to assess the functional abilities of cancer patients and predict their prognosis. It plays a crucial role in guiding treatment decisions made by physicians. This study aimed to build a stacking ensemble-based prognosis predictor model for predicting the ECOG PS of a liver cancer patient undergoing treatment. Methods We used Light Gradient Boosting Machine (LightGBM) as the meta-model, and five base models, including Random Forest (RF), Extra Trees (ET), AdaBoost (Ada), Gradient Boosting Machine (GBM), and Extreme Gradient Boosting (XGBoost). After preprocessing the data and applying feature selection method, the stacking ensemble model was trained using 1622 liver cancer patients’ data and 46 variables. We also integrated the stacking ensemble model with a LIME-based explainable model to obtain model prediction explainability. Results According to the research, the best combination of the stacking ensemble model is ET + XGBoost + RF + GBM + Ada + LightGBM and achieved a ROC AUC of 0.9826 on the training set and 0.9675 on the test set. Conclusions This explainable stacking ensemble model can become a helpful tool for objectively predicting ECOG PS in liver cancer patients and aiding healthcare practitioners to adapt their treatment approach more effectively.
... In data-driven medical diagnosis, it is also crucial to automatically pick out the major risk factors for certain disease among a large number of candidate indicators [39]. To address this issue, many variable selection techniques have been utilized to select the most relevant features, enhance model interpretability and avoid overfitting [40]. In principle, exhaustive searching of all possible combinations of variables is an ideal way for selecting the best subset. ...
Article
Full-text available
Background Data loss often occurs in the collection of clinical data. Directly discarding the incomplete sample may lead to low accuracy of medical diagnosis. A suitable data imputation method can help researchers make better use of valuable medical data. Methods In this paper, five popular imputation methods including mean imputation, expectation-maximization (EM) imputation, K-nearest neighbors (KNN) imputation, denoising autoencoders (DAE) and generative adversarial imputation nets (GAIN) are employed on an incomplete clinical data with 28,274 cases for vaginal prolapse prediction. A comprehensive comparison study for the performance of these methods has been conducted through certain classification criteria. It is shown that the prediction accuracy can be greatly improved by using the imputed data, especially by GAIN. To find out the important risk factors to this disease among a large number of candidate features, three variable selection methods: the least absolute shrinkage and selection operator (LASSO), the smoothly clipped absolute deviation (SCAD) and the broken adaptive ridge (BAR) are implemented in logistic regression for feature selection on the imputed datasets. In pursuit of our primary objective, which is accurate diagnosis, we employed diagnostic accuracy (classification accuracy) as a pivotal metric to assess both imputation and feature selection techniques. This assessment encompassed seven classifiers (logistic regression (LR) classifier, random forest (RF) classifier, support machine classifier (SVC), extreme gradient boosting (XGBoost) , LASSO classifier, SCAD classifier and Elastic Net classifier)enhancing the comprehensiveness of our evaluation. Results The proposed framework imputation-variable selection-prediction is quite suitable to the collected vaginal prolapse datasets. It is observed that the original dataset is well imputed by GAIN first, and then 9 most significant features were selected using BAR from the original 67 features in GAIN imputed dataset, with only negligible loss in model prediction. BAR is superior to the other two variable selection methods in our tests. Concludes Overall, combining the imputation, classification and variable selection, we achieve good interpretability while maintaining high accuracy in computer-aided medical diagnosis.
... In the classification literature, feature selection approaches have garnered a lot of attention, and they can be divided into three categories based on their interaction with the induction algorithm (Guyon et al., 2008;Shahrjooihaghighi & Frigui, 2021): filters, wrappers, and embedding methods. We choose filter methods over wrapper and embedded methods because we want to avoid the interaction with the classifier. ...
Article
Full-text available
The growth of Big Data has resulted in an overwhelming increase in the volume of data available, including the number of features. Feature selection, the process of selecting relevant features and discarding irrelevant ones, has been successfully used to reduce the dimensionality of datasets. However, with numerous feature selection approaches in the literature, determining the best strategy for a specific problem is not straightforward. In this study, we compare the performance of various feature selection approaches to a random selection to identify the most effective strategy for a given type of problem. We use a large number of datasets to cover a broad range of real-world challenges. We evaluate the performance of seven popular feature selection approaches and five classifiers. Our findings show that feature selection is a valuable tool in machine learning and that correlation-based feature selection is the most effective strategy regardless of the scenario. Additionally, we found that using improper thresholds with ranker approaches produces results as poor as randomly selecting a subset of features.
... In 2003, NIPS hosted the feature selection challenge [26]. It consisted of five binary classification problems which are further explained and discussed in [28]. The challenge was the classification of high dimensional data with a limited amount of training data being available. ...
Article
Full-text available
The literature has shown how to optimize and analyze the parameters of different types of neural networks using mixed integer linear programs (MILP). Building on these developments, this work presents an approach to do so for a McCulloch/Pitts and Rosenblatt neurons. As the original formulation involves a step-function, it is not differentiable, but it is possible to optimize the parameters of neurons, and their concatenation as a shallow neural network, by using a mixed integer linear program. The main contribution of this paper is to additionally enforce sparsity constraints on the weights and activations as well as on the amount of used neurons. Several experiments demonstrate that such constraints effectively prevent overfitting in neural networks, and ensure resource optimized models.
... Therefore, when confronted with datasets abundant in explanatory variables, various methods of integrating or transforming variable information become worth considering. Techniques such as feature selection [44,45], feature extraction [46,47], weighting variables [48,49], regularization [50], and split-and-merge [51] approaches can be integrated. Incorporating these into the our proposed BLogic algorithm may set the stage for a more streamlined inclusion of genuinely pivotal explanatory variables into the logic regression model, concluding our quest for enhanced predictability and interpretability in this field. ...
Article
Full-text available
With the increasing complexity and dimensionality of datasets in statistical research, traditional methods of identifying interactions are often more challenging to apply due to the limitations of model assumptions. Logic regression has emerged as an effective tool, leveraging Boolean combinations of binary explanatory variables. However, the prevalent simulated annealing approach in logic regression sometimes faces stability issues. This study introduces the BLogic algorithm, a novel approach that amalgamates multiple runs of simulated annealing on a dataset and synthesizes the results via the Bayesian model combination technique. This algorithm not only facilitates predicting response variables using binary explanatory ones but also offers a score computation for prime implicants, elucidating key variables and their interactions within the data. In simulations with identical parameters, conventional logic regression, when executed with a single instance of simulated annealing, exhibits reduced predictive and interpretative capabilities as soon as the ratio of explanatory variables to sample size surpasses 10. In contrast, the BLogic algorithm maintains its effectiveness until this ratio approaches 50. This underscores its heightened resilience against challenges in high-dimensional settings, especially the large p, small n problem. Moreover, employing real-world data from the UK10K Project, we also showcase the practical performance of the BLogic algorithm.
... The second approach uses the data to perform the feature selection. The algorithms under this approach are broadly classified into filter, embedded and wrapper algorithms [6][7][8] and could be used in supervised, semi-supervised or unsupervised learning frameworks [8][9][10]. Filter algorithms rely on the internal data structure of the features for selecting features. ...
Article
Full-text available
Background Feature selection is important in high dimensional data analysis. The wrapper approach is one of the ways to perform feature selection, but it is computationally intensive as it builds and evaluates models of multiple subsets of features. The existing wrapper algorithm primarily focuses on shortening the path to find an optimal feature set. However, it underutilizes the capability of feature subset models, which impacts feature selection and its predictive performance. Method and Results This study proposes a novel Artificial Intelligence based Wrapper (AIWrap) algorithm that integrates Artificial Intelligence (AI) with the existing wrapper algorithm. The algorithm develops a Performance Prediction Model using AI which predicts the model performance of any feature set and allows the wrapper algorithm to evaluate the feature subset performance in a model without building the model. The algorithm can make the wrapper algorithm more relevant for high-dimensional data. We evaluate the performance of this algorithm using simulated studies and real research studies. AIWrap shows better or at par feature selection and model prediction performance than standard penalized feature selection algorithms and wrapper algorithms. Conclusion AIWrap approach provides an alternative algorithm to the existing algorithms for feature selection. The current study focuses on AIWrap application in continuous cross-sectional data. However, it could be applied to other datasets like longitudinal, categorical and time-to-event biological data.
... Advanced random forest classifiers, including AL, applied to CASAS data in §3.3, use feature extraction as a preliminary step. This involves selecting optimal features to improve classification performance, such as normalising or scaling sensor output values [34]. Another technique often used is sliding windows, where sensor readings are taken in batches (windows) instead of single readings to improve classification results. ...
Thesis
Full-text available
With an increasingly ageing population, innovative technology is in high demand to assist the single-resident elderly population. Examples of this are long-term activity deterioration detection and finding explanations for abnormal behaviour. This leads to improved provided healthcare and enables a better understanding of what influences human behaviour. This thesis proposes multiple approaches for addressing these challenges based on related research. After applying human activity recognition techniques to label time-series ambient sensor data, insights into individuals’ long-term activities of daily living behaviour patterns and daily routines are provided. These are then used to discover changing activity levels to determine the impact of age and external factors, such as weather, leading to activity forecasting and possible long-term deterioration prediction. Center for Advanced Studies in Adaptive Systems (CASAS) datasets are used, and multiple methods are proposed, including process mining, clustering normal behaviour, applying daily score metrics, and time-series forecasting. Finally, the results are correlated with weather attributes to assess their impact and find ways to detect daily activity behaviour changes and deterioration over multiple years of sensor data.
... Including noise data at frequencies far beyond the expected ion frequency has the potential of 'confusing' classifiers, as such datapoints are 'informationless': a known concern for classifier reliability and efficiency [44,45]. Furthermore, the frequency resolution of the FFT allows signal peaks to be either clarified or obfuscated within the surrounding noise. ...
Article
Full-text available
The single-ion Penning trap (SIPT) at the Low-Energy Beam Ion Trapping Facility has been developed to perform precision Penning trap mass measurements of single ions, ideal for the study of exotic nuclei available only at low rates at the Facility for Rare Isotope Beams (FRIB). Single-ion signals are very weak—especially if the ion is singly charged—and the few meaningful ion signals must be disentangled from an often larger noise background. A useful approach for simulating Fourier transform ion cyclotron resonance signals is outlined and shown to be equivalent to the established yet computationally intense method. Applications of supervised machine learning algorithms for classifying background signals are discussed, and their accuracies are shown to be ≈65% for the weakest signals of interest to SIPT. Additionally, a deep neural network capable of accurately predicting important characteristics of the ions observed by their image charge signal is discussed. Signal classification on an experimental noise dataset was shown to have a false-positive classification rate of 10.5%, and 3.5% following additional filtering. The application of the deep neural network to an experimental 85Rb+ dataset is presented, suggesting that SIPT is sensitive to single-ion signals. Lastly, the implications for future experiments are discussed.
... Basically, DL inherits the fundamental function of ML, which is to evolve an AI module that can map the given inputs to corresponding desired outputs in a certain range. However, different from the usual machine learning based on hand-crafted features (Guyon et al. 2008), DL forms an end-to-end framework that is able to learn task-specifically more expressive features (Yang et al. 2023c). With the supportive foundations of big data and efficient computing resources, DL architectures can possess the excellent ability to implement the fundamental function of machine learning. ...
Preprint
Full-text available
Artificial intelligence (AI) alignment is an open and important problem within AI, and there is no solution that is able to address this problem fully. To alleviate this situation, in this article, we formally and completely demonstrated the concept of discovering scientific paradigms (SPs) for AI alignment. Primarily, we systematically established the concept of discovering SPs for AI alignment, via providing explanations for related fundamental terminologies and connection analysis. Subsequently, referring to some previously conducted works, we organized specific contents for this concept, including: 1) presenting several SPs discovered for AI alignment, 2) revealing the methodologies of discovering these presented SPs, and 3) showing the contributions of the presented SPs for AI alignment in real-world scenarios. Additionally, we further present comprehensive discussions to highlight the intrinsic properties and expected potentials of discovering SPs for AI alignment. The whole article establishes a practical knowledge system for scientifically addressing the AI alignment problem, via establishing the concept of discovering SPs for AI alignment, organizing specific contents for the concept, and discussing the intrinsic properties and expected potentials of discovering SPs for AI alignment. We hope this article can make a scientific contribution to AI alignment.
... The latter has not only provided machine learning with a wealth of opportunities but has also significantly increased the dimensionality of data. Feature selection addresses this problem by selecting the relevant features from the data while discarding irrelevant or redundant ones [1]. ...
... Feature selection [2] is a technique used in machine learning to reduce the dimensionality of a dataset, with the goal of selecting the features that provide useful information for our predictive model, and therefore reducing the amount of data used. On the other hand, transfer learning is another method that aims to make use of already learned knowledge for one domain in a different domain. ...
... Embedded methods incorporate feature selection as a part of training, while wrapper methods interact in a feedback loop with the learning model. Filter methods select a subset of features based on properties of the dataset before the model is able to learn on the dataset, which differs from embedded and wrapper methods in that they do not form a feedback loop with the model (Guyon et al. 2008). Because of their independence, they tend to generalize well (Remeseiro and Bolon-Canedo 2019). ...
Preprint
Full-text available
Survey data can contain a high number of features while having a comparatively low quantity of examples. Machine learning models that attempt to predict outcomes from survey data under these conditions can overfit and result in poor generalizability. One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon. A relatively unexplored source of information in the feature selection process is the usage of textual names of features, which may be semantically indicative of which features are relevant to a target outcome. The relationships between feature names and target names can be evaluated using language models (LMs) to produce semantic textual similarity (STS) scores, which can then be used to select features. We examine the performance using STS to select features directly and in the minimal-redundancy-maximal-relevance (mRMR) algorithm. The performance of STS as a feature selection metric is evaluated against preliminary survey data collected as a part of a clinical study on persistent post-surgical pain (PPSP). The results suggest that features selected with STS can result in higher performance models compared to traditional feature selection algorithms.
Article
Heart disease causes significant mortality rates worldwide and has become a health threat for many people. Thus, proper monitoring and early detection of cardiac disease can decrease the significant problem when attempting to forecast heart disease. Accurate prediction of heart disease from clinical data is a significant challenge. This study aims to develop an autonomous system capable of diagnosing heart disease by employing feature extraction and selection techniques. The majority of the time, decision support systems are employed for automated disease diagnosis in humans. The choice of the most relevant characteristics has a major impact on these systems' performance. This research aims to develop a feature extraction methodology to identify relevant patterns within heart disorder data. These extracted features will subsequently be employed to classify various heart conditions. This analysis demonstrates a methodology that may be used to find a lower dimensional collection of features from test data and utilize those features to diagnose heart disease. The presented methodology utilizes Probabilistic Principal Component Analysis (PPCA) that capture high impact characteristics in a new projection. With the use of PPCA, projection vectors with the highest covariance contributions are extracted and used to minimize feature dimension. The method's performance was evaluated using standard metrics: accuracy, specificity, and precision. PPCA demonstrated superior performance in classifying heart disease compared to other methods, achieving an accuracy, specificity, and precision
Article
Objectives Serum protein electrophoresis (SPE) in combination with immunotyping (IMT) is the diagnostic standard for detecting monoclonal proteins (M-proteins). However, interpretation of SPE and IMT is weakly standardized, time consuming and investigator dependent. Here, we present five machine learning (ML) approaches for automated detection of M-proteins on SPE on an unprecedented large and well-curated data set and compare the performance with that of laboratory experts. Methods SPE and IMT were performed in serum samples from 69,722 individuals from Norway. IMT results were used to label the samples as M-protein present (positive, n=4,273) or absent (negative n=65,449). Four feature-based ML algorithms and one convolutional neural network (CNN) were trained on 68,722 randomly selected SPE patterns to detect M-proteins. Algorithm performance was compared to that of an expert group of clinical pathologists and laboratory technicians (n=10) on a test set of 1,000 samples. Results The random forest classifier showed the best performance (F1-Score 93.2 %, accuracy 99.1 %, sensitivity 89.9 %, specificity 99.8 %, positive predictive value 96.9 %, negative predictive value 99.3 %) and outperformed the experts (F1-Score 61.2 ± 16.0 %, accuracy 89.2 ± 10.2 %, sensitivity 94.3 ± 2.8 %, specificity 88.9 ± 10.9 %, positive predictive value 47.3 ± 16.2 %, negative predictive value 99.5 ± 0.2 %) on the test set. Interestingly the performance of the RFC saturated, the CNN performance increased steadily within our training set (n=68,722). Conclusions Feature-based ML systems are capable of automated detection of M-proteins on SPE beyond expert-level and show potential for use in the clinical laboratory.
Article
Full-text available
Tabular data analysis is a critical task in various domains, enabling us to uncover valuable insights from structured datasets. While traditional machine learning methods can be used for feature engineering and dimensionality reduction, they often struggle to capture the intricate relationships and dependencies within real-world datasets. In this paper, we present Multi-representation DeepInsight (MRep-DeepInsight), a novel extension of the DeepInsight method designed to enhance the analysis of tabular data. By generating multiple representations of samples using diverse feature extraction techniques, our approach is able to capture a broader range of features and reveal deeper insights. We demonstrate the effectiveness of MRep-DeepInsight on single-cell datasets, Alzheimer's data, and artificial data, showcasing an improved accuracy over the original DeepInsight approach and machine learning methods like random forest, XGBoost, LightGBM, FT-Transformer and L2-regularized logistic regression. Our results highlight the value of incorporating multiple representations for robust and accurate tabular data analysis. By leveraging the power of diverse representations, MRep-DeepInsight offers a promising new avenue for advancing decision-making and scientific discovery across a wide range of fields.
Article
Full-text available
The nature of the input features is one of the key factors indicating what kind of tools, methods, or approaches can be used in a knowledge discovery process. Depending on the characteristics of the available attributes, some techniques could lead to unsatisfactory performance or even may not proceed at all without additional preprocessing steps. The types of variables and their domains affect performance. Any changes to their form can influence it as well, or even enable some learners. On the other hand, the relevance of features for a task constitutes another element with a noticeable impact on data exploration. The importance of attributes can be estimated through the application of mechanisms belonging to the feature selection and reduction area, such as rankings. In the described research framework, the data form was conditioned on relevance by the proposed procedure of gradual discretisation controlled by a ranking of attributes. Supervised and unsupervised discretisation methods were employed to the datasets from the stylometric domain and the task of binary authorship attribution. For the selected classifiers, extensive tests were performed and they indicated many cases of enhanced prediction for partially discretised datasets.
Article
Full-text available
Feature Selection (FS) is an essential research topic in the area of machine learning. FS, which is the process of identifying the relevant features and removing the irrelevant and redundant ones, is meant to deal with the high dimensionality problem for the sake of selecting the best performing feature subset. In the literature, many feature selection techniques approach the task as a research problem, where each state in the search space is a possible feature subset. In this paper, we introduce a new feature selection method based on reinforcement learning. First, decision tree branches are used to traverse the search space. Second, a transition similarity measure is proposed so as to ensure exploit‐explore trade‐off. Finally, the informative features are the most involved ones in constructing the best branches. The performance of the proposed approaches is evaluated on nine standard benchmark datasets. The results using the AUC score show the effectiveness of the proposed system.
Article
The estimation of Suspended Sediment Load (SSL) is challenging due to its complex nature within the field of hydrology. The selection and reduction of input feature dimensions, along with the non-uniformity of the sediment and estimation accuracy, pose challenges when estimating suspended sediment load. By combining support vector regression models with MOGWO, MOPSO, and MOOTLBO, this study aimed to predict suspended sediment load. Hybrid models are constructed with three key objectives: enhancing accuracy, minimizing feature count, and optimizing sediment classification. In order to accomplish this, two scenarios have been formulated. The first scenario prioritizes performance accuracy, whereas the second scenario assigns equal importance to three objectives. The study focuses on the Kosar Dam watershed in southwest Iran. The CHIRPS precipitation product and the GLDAS soil moisture product are considered to be predictors. The extraction of input features is carried out utilizing Principal Component Analysis (PCA). The results from both scenarios indicated that the optimal sediment classification consists of 10 categories, demonstrating superior performance across all three integrated models. Incorporating feature selection enhances the model's performance and decreases the number of features. In the first scenario, 10 features are selected from a set of 30 input feature vectors, while in the second scenario, 14 features are chosen. Consequently, hybrid models prove to be effective in reducing input features, optimizing classification of SSL, and enhancing prediction accuracy. In general, the SVR-MOOTLBO model exhibits superior performance when compared to other models. The performance indices (R, RMSE, MAE, PBIAS, and MD) exhibit variations ranging from 0.02% to 0.75% between the first and second scenarios in SVR-MOOTLBO, while the RPIQ index shows a relatively modest difference of 6.2%. In both scenarios, the SVR parameters are well-tuned, and the search agents of the MOOTLBO algorithm exhibit effective functionality.
Article
The brain biomarker of irritable bowel syndrome (IBS) patients is still lacking. The study aims to explore a new technology studying the brain alterations of IBS patients based on multi-source brain data. In the study, a decision-level fusion method based on gradient boosting decision tree (GBDT) was proposed. Next, 100 healthy subjects were used to validate the effectiveness of the method. Finally, the identification of brain alterations and the pain evaluation in IBS patients were carried out by the fusion method based on the resting-state fMRI and DWI for 46 patients and 46 controls selected randomly from 100 healthy subjects. The results showed that the method can achieve good classification between IBS patients and controls (accuracy = 95%) and pain evaluation of IBS patients (mean absolute error = 0.1977). Moreover, both the gain-based and the permutation-based evaluation instead of statistical analysis showed that left cingulum bundle contributed most significantly to the classification, and right precuneus contributed most significantly to the evaluation of abdominal pain intensity in the IBS patients. The differences seem to suggest a probable but unexplored separation about the central regions between the identification and progression of IBS. This finding may provide one new thought and technology for brain alteration related to IBS.
Article
A proposal is made in this paper regarding the deep feed-forward neural network for the microarray binary dataset’s classification. We have used eight binary class standard datasets of microarray cancer used the purpose of validating the suggested approach, specifically cancers of the brain, colon, prostate, leukemia, ovary, lung-Harvard2, lung-Michigan, and breast.In addition, six multiclass microarray datasets namely 3-class Leukemia, 4-class Leukemia,4-class SRBCT, 3-class MLL, 5-class Lung cancer and 11-class Tumor are alsoconsidered.To come out with curse of dimensionality, the method for reducing dimensionality is PCA in binary class dataset’s case. We have crafted architecture of neural network which is fully connected, configuring its parameters with sigmoid initialization for network's input and hidden layer. This includes specifying the number of epochs, batch sizes, and selecting appropriate activation functions. The suggested method's multiclass behavior is made possible by initializing the activation function SoftMax to the output layer. The min-max approach is used for featurescaling.To compute the magnitude of error of the method,binary cross-entropy, and categorical cross-entropy are used on the binary and multi-class datasets and the ADAM optimizer is for optimization.A study is conducted to compare the suggested approach with the most advanced techniques available. According to experimental findings on these common microarray datasets and comparisons with the most advanced technique, the suggested method's performance is quite respectable.
ResearchGate has not been able to resolve any references for this publication.