Article

UCI Machine Learning Repository

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In this section, we conduct an in-depth experimental evaluation of the newly proposed B-TSK-FC method. The experimental data are sourced from benchmark datasets of KEEL [42] and UCI [43]. Moreover, we have also compared the B-TSK-FC method with five other leading imbalanced class classifiers and ensemble classifiers. ...
... To ensure the fairness and comprehensiveness of our study, we selected 15 benchmark datasets from KEEL [42] and UCI [43] to conduct a consistent evaluation of the B-TSK-FC method compared to other comparative methods. In order to thoroughly evaluate the classification performance of the proposed B-TSK-FC, fifteen datasets of various numbers of dimensionalities, classes, instances and imbalance ratios in a wide range are adopted in this experiment. ...
... In this subsection, we conducted a statistical test on the performance of eight classimbalanced learning methods, including our proposed B-TSK-FC method and seven advanced existing imbalanced learning methods, including SMOTE+TSK, W-TSK, SMOTE+KNN, RUSBoost, OverBoost, SMOTEBagging, and SMOTEBoost. The experimental data comprised 15 benchmark imbalanced datasets sourced from KEEL [42] and UCI [43]. To assess the performance of these methods, we conducted the Friedman test. ...
Article
Full-text available
With the expansion of data scale and diversity, the issue of class imbalance has become increasingly salient. The current methods, including oversampling and under-sampling, exhibit limitations in handling complex data, leading to overfitting, loss of critical information, and insufficient interpretability. In response to these challenges, we propose a broad TSK fuzzy classifier with a simplified set of fuzzy rules (B-TSK-FC) that deals with classification tasks with class-imbalanced data. Firstly, we select and optimize fuzzy rules based on their adaptability to different complex data to simplify the fuzzy rules and therefore improve the interpretability of the TSK fuzzy sub-classifiers. Secondly, the fuzzy rules are weighted to protect the information demonstrated by minority classes, thereby improving the classification performance on class-imbalanced datasets. Finally, a novel loss function is designed to derive the weights for each TSK fuzzy sub-classifier. The experimental results on fifteen benchmark datasets demonstrate that B-TSK-FC is superior to the comparative methods from the aspects of classification performance and interpretability in the scenario of class imbalance.
... This study uses a quantitative approach and leverages open-source tools like R-language [16] and WEKA [17] for data analysis. Several benchmark datasets with ordinal values from the UCI machine learning repository [18] are selected. These datasets are then manipulated to create subsets with varying degrees of missing values: 1%, 5%, 10%, 15%, 20%, and 25%. ...
... The summary of the commonly used techniques for treating missing values is presented in Table 1. Jöreskog [5] Joreskog's classification of missing data types Pantanowitz and Marwala [3] Factors contributing to missing values in surveys Huisman [8] Imputation techniques for improving data analysis Albright et al. [2] Challenges posed by missing data in analysis Tufféry [1] Impact of missing data on decision-making Lee et al. [6] Imputation as a solution for replacing missing data Eekhout et al. [7] Uninformed patterns of absent data in real-world datasets Bache and Lichman [18] UCI Machine Learning Repository for benchmark datasets Shah et al. [4] Critique of discarding data vectors with missing values Core [16] R-language as an open-source tool for data analysis Choudhury and Pal [20] Artificial Neural Networks for model-based imputation Pujianto et al. [21] Evaluation of k-Nearest Neighbors imputation method Mercaldo and Blume [22] Missing data imputation using predictive models Hung et al. [23] Impact of removing samples with high missing values proportion Emmanuel et al. [24] Determining the acceptable threshold for missing values Austin et al. [25] Limitations of simple statistical imputation approaches Ismail et al. [26] Systematic review of data imputation methods in data mining Chiu et al. [27] Data imputation for missing values and ensuring data integrity Shahzad et al. [28] Regression techniques for model-based imputation Lin et al. [29] Deep Learning-based imputation approaches Ahn et al. [30] Comparison of dynamic imputation techniques Various methods have been developed to handle missing values in numeric data, although their suitability for ordinal datasets may vary [11]. In the context of treating missing values, several commonly used techniques are worth mentioning. ...
... We have selected five benchmark datasets from the UCI machine learning repository [18] that exhibit the presence of ordinal values, as summarized in Table 2. These datasets serve as the foundation for evaluating diverse imputation methods. ...
Preprint
Full-text available
Missing data can significantly impact dataset integrity and suitability, leading to unreliable statistical results, distortions, and poor decisions. The presence of missing values in data introduces inaccuracies in clustering and classification and compromises the reliability and validity of such analyses. This study investigates multiple imputation techniques specifically designed for handling missing values in ordinal data commonly encountered in surveys and questionnaires. Quantitative approaches are used to evaluate different imputation methods on various datasets with varying missing value percentages. By comparing the performance of imputation techniques using clustering metrics and algorithms (e.g., k-means, Partitioning Around Medoids), the study provides valuable insights for selecting appropriate imputation methods for accurate data analysis. Furthermore, the study examines the impact of imputed values on classification algorithms, including k-Nearest Neighbors (kNN), Naive Bayes (NB), and Multilayer Per-ceptron (MLP). Results demonstrate that the decision tree method is the most effective approach, closely aligning with the original data and achieving high accuracy. In contrast, random number imputation performs poorly, indicating limited reliability. This study advances the understanding of handling missing values and emphasizes the need to address this issue to enhance data analysis integrity and validity.
... This study uses a quantitative approach and leverages open-source tools like R-language [16] and WEKA [17] for data analysis. Several benchmark datasets with ordinal values from the UCI machine learning repository [18] are selected. These datasets are then manipulated to create subsets with varying degrees of missing values: 1%, 5%, 10%, 15%, 20%, and 25%. ...
... The summary of the commonly used techniques for treating missing values is presented in Table 1. Jöreskog [5] Joreskog's classification of missing data types Pantanowitz and Marwala [3] Factors contributing to missing values in surveys Huisman [8] Imputation techniques for improving data analysis Albright et al. [2] Challenges posed by missing data in analysis Tufféry [1] Impact of missing data on decision-making Lee et al. [6] Imputation as a solution for replacing missing data Eekhout et al. [7] Uninformed patterns of absent data in real-world datasets Bache and Lichman [18] UCI Machine Learning Repository for benchmark datasets Shah et al. [4] Critique of discarding data vectors with missing values Core [16] R-language as an open-source tool for data analysis Choudhury and Pal [20] Artificial Neural Networks for model-based imputation Pujianto et al. [21] Evaluation of k-Nearest Neighbors imputation method Mercaldo and Blume [22] Missing data imputation using predictive models Hung et al. [23] Impact of removing samples with high missing values proportion Emmanuel et al. [24] Determining the acceptable threshold for missing values Austin et al. [25] Limitations of simple statistical imputation approaches Ismail et al. [26] Systematic review of data imputation methods in data mining Chiu et al. [27] Data imputation for missing values and ensuring data integrity Shahzad et al. [28] Regression techniques for model-based imputation Lin et al. [29] Deep Learning-based imputation approaches Ahn et al. [30] Comparison of dynamic imputation techniques Various methods have been developed to handle missing values in numeric data, although their suitability for ordinal datasets may vary [11]. In the context of treating missing values, several commonly used techniques are worth mentioning. ...
... We have selected five benchmark datasets from the UCI machine learning repository [18] that exhibit the presence of ordinal values, as summarized in Table 2. These datasets serve as the foundation for evaluating diverse imputation methods. ...
... The meta-features 1-29, shown in Table 6, are multiple views of data characteristics given in Tables 1, 2, 3 and 4. Similarly, the best-classifier (last column) is the label of one or more, best decision tree classifiers, from Table 5. The size of the case-base is 100 resolved cases, authored from 100 freely available classification datasets collected from UCI [29] and OpenML [30] machine learning repositories. A subset of datasets used for case-base creation is provided in Table 7 along with brief descriptions of the general characteristics of the datasets In the proposed Case-Base, all the features are real numbers, so their data types are set to numeric. ...
... For training, the CBR model, i.e., Case-Base, is constructed using 100 multi-class classification datasets, as shown in Table 9. These datasets are sourced from the UCI machine learning repository [29] and OpenML repositories [30]. Similarly, a separate set of 52 datasets is employed for testing the methodology. ...
Article
Full-text available
In practical data mining, a wide range of classification algorithms is employed for prediction tasks. However, selecting the best algorithm poses a challenging task for machine learning practitioners and experts, primarily due to the inherent variability in the characteristics of classification problems, referred to as datasets, and the unpredictable performance of these algorithms. Dataset characteristics are quantified in terms of meta-features, while classifier performance is evaluated using various performance metrics. The assessment of classifiers through empirical methods across multiple classification datasets, while considering multiple performance metrics, presents a computationally expensive and time-consuming obstacle in the pursuit of selecting the optimal algorithm. Furthermore, the scarcity of sufficient training data, denoted by dimensions representing the number of datasets and the feature space described by meta-feature perspectives, adds further complexity to the process of algorithm selection using classical machine learning methods. This research paper presents an integrated framework called eML-CBR that combines edge edge-ML and case-based reasoning methodologies to accurately address the algorithm selection problem. It adapts a multi-level, multi-view case-based reasoning methodology, considering data from diverse feature dimensions and the algorithms from multiple performance aspects, that distributes computations to both cloud edges and centralized nodes. On the edge, the first-level reasoning employs machine learning methods to recommend a family of classification algorithms, while at the second level, it recommends a list of the top-k algorithms within that family. This list is further refined by an algorithm conflict resolver module. The eML-CBR framework offers a suite of contributions, including integrated algorithm selection, multi-view meta-feature extraction, innovative performance criteria, improved algorithm recommendation, data scarcity mitigation through incremental learning, and an open-source CBR module, reshaping research paradigms. The CBR module, trained on 100 datasets and tested with 52 datasets using 9 decision tree algorithms, achieved an accuracy of 94% for correct classifier recommendations within the top k=3 algorithms, making it highly suitable for practical classification applications.
... This experiment is conducted using WEKA software [11] with the con guration of computer system 4 GB RAM, Intel(R) Core (TM)2 CPU1.73 GHz Processor, Windows764-bit operating system. For the conduction of this experiment, the diabetes medical dataset (Pima Indians diabetes dataset) has been collected from University of California, Irvine (UCI) machine learning repository [11]. ...
... This experiment is conducted using WEKA software [11] with the con guration of computer system 4 GB RAM, Intel(R) Core (TM)2 CPU1.73 GHz Processor, Windows764-bit operating system. For the conduction of this experiment, the diabetes medical dataset (Pima Indians diabetes dataset) has been collected from University of California, Irvine (UCI) machine learning repository [11]. The sample view of the dataset is illustrated in Fig. 2. ...
Preprint
Full-text available
Diabetes has become increasingly prevalent in people of all age groups, posing a significant health challenge. The rising number of individuals affected by diabetes can be attributed to various factors, including bacterial or viral infections, consumption of food contaminated with toxic substances, autoimmune reactions, obesity, poor dietary choices, changes in lifestyle, and environmental pollution. Consequently, early and accurate diagnosis of diabetes is crucial to safeguard lives. Data analytics plays a pivotal role in uncovering hidden patterns within vast datasets, enabling informed conclusions. Within the healthcare sector, machine learning algorithms are employed to analyze medical data and construct models that facilitate medical diagnoses. This paper introduces a diabetes prediction system designed to diagnose diabetes accurately and effectively. Moreover, this paper also explores the approaches to improve the accuracy in diabetes prediction using medical data with various machine learning algorithms and methods.
... Machine learning (ML) tools have been increasingly utilized in high-stake tasks such as credit assessments [26] and crime predictions [22]. Despite their success, the data-driven nature of existing machine learning methods makes them easily inherit the biases buried in the training data and thus results in predictions with discrimination against some sensitive groups [33]. ...
... In this subsection, we introduce the datasets used in our experiments. To evaluate the performance of FEAST on fair few-shot learning, we conduct experiments on three prevalent real-world datasets: Adult [15], Crime [22], and Bank [26]. The detailed dataset statistics are provided in Table 1. ...
Chapter
Full-text available
Recently, there has been a growing interest in developing machine learning (ML) models that can promote fairness, i.e., eliminating biased predictions towards certain populations (e.g., individuals from a specific demographic group). Most existing works learn such models based on well-designed fairness constraints in optimization. Nevertheless, in many practical ML tasks, only very few labeled data samples can be collected, which can lead to inferior fairness performance. This is because existing fairness constraints are designed to restrict the prediction disparity among different sensitive groups, but with few samples, it becomes difficult to accurately measure the disparity, thus rendering ineffective fairness optimization. In this paper, we define the fairness-aware learning task with limited training samples as the fair few-shot learning problem. To deal with this problem, we devise a novel framework that accumulates fairness-aware knowledge across different meta-training tasks and then generalizes the learned knowledge to meta-test tasks. To compensate for insufficient training samples, we propose an essential strategy to select and leverage an auxiliary set for each meta-test task. These auxiliary sets contain several labeled training samples that can enhance the model performance regarding fairness in meta-test tasks, thereby allowing for the transfer of learned useful fairness-oriented knowledge to meta-test tasks. Furthermore, we conduct extensive experiments on three real-world datasets to validate the superiority of our framework against the state-of-the-art baselines.
... In this section, we apply all the studied imputation methods to real dataset, Wine Dateset. The dataset was extracted from UCI Machine Learning Repository (Lichman , 2013). To form the relationship between Nonflavanoid phenols, Color intensity and OD280/OD315 of diluted wines, multiple regression analysis was used. ...
... In this section, we apply all the studied imputation methods to real dataset, Wine Dateset. The dataset was extracted from UCI Machine Learning Repository (Lichman , 2013). To form the relationship between Nonflavanoid phenols, Color intensity and OD280/OD315 of diluted wines, multiple regression analysis was used. ...
Article
The purpose of this research is to compare the efficiency of different imputation methods for multiple regression analysis of heteroscedastic data with missing at random dependent variable. The missing data imputation methods used in this study are mean imputation, hot deck imputation, k-nearest neighbors imputation (KNN), stochastic regression imputation, along with three proposed composite methods, namely hot deck and KNN imputation with equivalent weight (HKEW), hot deck and stochastic regression imputation with equivalent weight (HSEW), and mean and stochastic regression imputation with equivalent weight (MSEW). The comparison between the seven methods was conducted through the simulation study varied by the sample sizes and the missing percentages. The criteria for comparing the efficiency of estimators are bias and mean squared error (MSE). The results show that the stochastic regression imputation performed well in terms of bias in all situations. In terms of MSE, the mean imputation performed well when the sample size is small to medium, whereas the MSEW imputation performed well when the sample size is large and the missing percentage is high (30-40%).
... However, these methods may not always be sufficient to diagnose heart disease accurately, especially in the early stages [4]. In recent years, data mining and machine learning techniques have been used to analyze large amounts of data to help healthcare professionals predict and diagnose heart disease more effectively [5]. ...
Article
Full-text available
Heart disease remains a predominant health challenge, being the leading cause of death worldwide. According to the World Health Organization (WHO), cardiovascular diseases (CVDs) take an estimated 17.9 million lives each year, accounting for 32% of all global deaths. Thus, there is a global health concern necessitating accurate prediction models for timely intervention. Several data mining techniques are used by researchers to help healthcare professionals to predict heart disease. However, the traditional machine learning models for predicting heart disease often struggle with handling imbalanced datasets. Moreover, when prediction is on the bases of complex data like ECG, feature extraction and selecting the most pertinent features that accurately represent the underlying pathophysiological conditions without succumbing to overfitting is also a challenge. In this paper, a continuous wavelet transformation and convolutional neural network-based hybrid model abbreviated as WT-CNN is proposed. The key phases of WT-CNN are ECG data collection, preprocessing, RUSBoost-based data balancing, CWT-based feature extraction, and CNN-based final prediction. Through extensive experimentation and evaluation, the proposed model achieves an exceptional accuracy of 97.2% in predicting heart disease. The experimental results show that the approach improves classification accuracy compared to other classification approaches and that the presented model can be successfully used by healthcare professionals for predicting heart disease. Furthermore, this work can have a potential impact on improving heart disease prediction and ultimately enhancing patient lifestyle.
... We also examined 55 real datasets to make significant findings about the impact of feature selection. There were 38 datasets with at least nine features downloaded from the UCI repository (Bache & Linchman, 2013), as well as 17 microarray datasets due to their high dimensionality (Morán-Fernández et al., 2017;Remeseiro & Bolón-Canedo, 2019). Tables 3 and 4 depict key properties of the datasets used in this investigation, such as sample size, number of features and classes. ...
Article
Full-text available
The growth of Big Data has resulted in an overwhelming increase in the volume of data available, including the number of features. Feature selection, the process of selecting relevant features and discarding irrelevant ones, has been successfully used to reduce the dimensionality of datasets. However, with numerous feature selection approaches in the literature, determining the best strategy for a specific problem is not straightforward. In this study, we compare the performance of various feature selection approaches to a random selection to identify the most effective strategy for a given type of problem. We use a large number of datasets to cover a broad range of real-world challenges. We evaluate the performance of seven popular feature selection approaches and five classifiers. Our findings show that feature selection is a valuable tool in machine learning and that correlation-based feature selection is the most effective strategy regardless of the scenario. Additionally, we found that using improper thresholds with ranker approaches produces results as poor as randomly selecting a subset of features.
... For comparison study, the state-of-the-art BNCs described as follows run on 44 benchmark datasets from the UCI machine learning repository [44]. Table 2 provides detailed characteristics of these datasets. ...
Article
Full-text available
Bayesian network classifier (BNC) allows efficient and effective inference under condition of uncertainty for classification, and it depicts the interdependencies among random variables using directed acyclic graph (DAG). However, learning an optimal BNC is NP-hard, and complicated DAGs may lead to biased estimates of multivariate probability distributions and subsequent degradation in classification performance. In this study, we suggest using the entropy function as the scoring metric, and then apply greedy search strategy to improve the fitness of learned DAG to training data at each iteration. The proposed algorithm, called One\(+\) Bayesian Classifier (O\(^{+}\)BC), can represent high-dependence relationships in its robust DAG with a limited number of directed edges. We compare the performance of O\(^{+}\)BC with other six state-of-the-art single and ensemble BNCs. The experimental results reveal that O\(^{+}\)BC demonstrates competitive or superior performance in terms of zero-one loss, bias-variance decomposition, Friedman and Nemenyi tests.
... In Hu et al. (2008), the authors applied the AdaBoost algorithm on the KDD99 dataset and achieved better accuracy with fewer false alarms. In Nader et al. (2014), the authors used one-class SVM with kernel PCA to detect attacks in the Gas Pipeline testbed and water treatment plant (Lichman et al. 2013). Further, different studies also use reconstruction-based deep learning methods (Feng et al. 2017;Goh et al. 2017;Taormina and Galelli 2018). ...
Article
Full-text available
Due to the importance of Critical Infrastructure (CI) in a nation’s economy, they have been lucrative targets for cyber attackers. These critical infrastructures are usually Cyber-Physical Systems such as power grids, water, and sewage treatment facilities, oil and gas pipelines, etc. In recent times, these systems have suffered from cyber attacks numerous times. Researchers have been developing cyber security solutions for CIs to avoid lasting damages. According to standard frameworks, cyber security based on identification, protection, detection, response, and recovery are at the core of these research. Detection of an ongoing attack that escapes standard protection such as firewall, anti-virus, and host/network intrusion detection has gained importance as such attacks eventually affect the physical dynamics of the system. Therefore, anomaly detection in physical dynamics proves an effective means to implement defense-in-depth. PASAD is one example of anomaly detection in the sensor/actuator data, representing such systems’ physical dynamics. We present EPASAD, which improves the detection technique used in PASAD to detect these micro-stealthy attacks, as our experiments show that PASAD’s spherical boundary-based detection fails to detect. Our method EPASAD overcomes this by using Ellipsoid boundaries, thereby tightening the boundaries in various dimensions, whereas a spherical boundary treats all dimensions equally. We validate EPASAD using the dataset produced by the TE-process simulator and the C-town datasets. The results show that EPASAD improves PASAD’s average recall by 5.8% and 9.5% for the two datasets, respectively.
... In order to validate the efficiency of the proposed MDC algorithm, we conducted dataset clustering experiments not only on synthetic datasets, but also on real data, with clusters of various densities, shapes and sizes [25][26], while comparing the results with those obtained by DBSCAN and DPC. The algorithms are implemented in the Integrated Development Environment of Python (Python IDE version 3.7.5). ...
... Table 1 shows the information on them. The first four data sets originate from Bache and Lichman (2013), and the remaining ones from Olszewski (2001). The discrete functional samples in each data set are of different lengths (see Table 1). ...
Article
Full-text available
In this paper, the scale response functional multivariate regression model is considered. By using the basis functions representation of functional predictors and regression coefficients, this model is rewritten as a multivariate regression model. This representation of the functional multivariate regression model is used for multiclass classification for multivariate functional data. Computational experiments performed on real labelled data sets demonstrate the effectiveness of the proposed method for classification for functional data.
... However, classes also separate along x 1 , albeit less perfectly. Unrestricted GMLVQ with one prototype per class realizes near perfect classification with BAC Wisconsin Diagnostic Breast Cancer data: This benchmark data set from the UCI Machine Learning Repository [10] contains 569 samples with 30 features extracted from cells in an image of a fine needle aspirate of a breast mass (357 benign, 212 malignant). For illustration purposes, we train a GMLVQ system using 25% randomly sampled training data, and use the remaining 75% as a test set. ...
Conference Paper
Full-text available
We introduce and investigate the iterated application of Generalized Matrix Relevance Learning for the analysis of feature relevances in classification problems. The suggested Iterated Relevance Matrix Analysis (IRMA), identifies a linear subspace representing the classification specific information of the considered data sets in feature space using Generalized Matrix Learning Vector Quantization. By iteratively determining a new discriminative direction while projecting out all previously identified ones, all features carrying relevant information about the classification can be found, facilitating a detailed analysis of feature relevances. Moreover, IRMA can be used to generate improved low-dimensional representations and visualizations of labeled data sets.
... The proposed method and other models are applied to nine benchmark datasets of the UCI repository for binary classification [39]. Table 1 shows the basic messages about the standard datasets chosen in this section, including the number of dataset variables, the dataset sample size, and the imbalance rate of the dataset. ...
Article
Full-text available
A new second-order cone programming (SOCP) formulation inspired by the soft-margin linear programming support vector machine (LP-SVM) formulation and cost-sensitive framework is proposed. Our proposed method maximizes the slack variables related to each class by appropriately relaxing the bounds on the VC dimension using the l\(_{\infty }\)-norm, and penalizes them using the corresponding regularization parametrization to control the trade-off between margin and slack variables. The proposed method has two main advantages: firstly, a flexible classifier is constructed that extends the advantages of the soft-margin LP-SVM problem to the second-order cone; secondly, due to the elimination of a conic restriction, only two SOCP problems containing second-order cone constraints need to be solved. Thus similar results to the SOCP-SVM problem are obtained with less calculational effort. Numerical experiments show that our method achieves the better classification performance than the conventional SOCP-SVM formulations and standard linear SVM formulations.
... Axis-parallel subspaces are regarded as the specific case of two-way clustering, also known as Co-clustering or Biclustering. These techniques cluster the objects simultaneously as the feature matrix made of of data objects spread across rows and [11]. They typically don't function with random feature combinations, as is the case with subspace approaches in general. ...
Article
Full-text available
Multidimensional data is more prevalent due to the rapid expansion of computational biometric and e-commerce applications. As a result, mining multidimensional data is a crucial issue with significant practical implications. The curse of dimensionality and, more importantly, the meaning of the similarity measure in the high dimension space are two specific difficulties that arise while mining high-dimensional data. The challenges and methods for dimensionality reduction of multidimensional data are surveyed in this work.
... We employed some classifiers, where the performance was analyzed by applying seven benchmarks consolidated and widely utilized in the literature, such as Parkinson, Pima, and Ionosphere provided by UCI Machine Learning Repository [70], and Haberman, Monk2, Appendicitis, and Sonar provided by KEEL (Knowledge Extraction based on Evolutionary Learning) [71]. The information about the input features and the number of samples is presented in Table 2. ...
Preprint
Full-text available
In railway operations, there are several factors that must be analyzed, such as operation cost, maintenance stops, failures, and others. One of these important topics is the analysis of the hot box and hot wheel due to the failure of these components. It can compromise the entire operation, resulting in serious accidents, such as train derailments. Thus, the use of a method that is able to classify a failure is essential for accident prevention. Therefore, as these failures in hot box and hot wheels are binary classification problems and nonlinear data, this work proposes a new method based on the Multilayer Perceptron combined with Set-Membership. The Multilayer Perceptron is very flexible and can be used generally to learn complex problems with the aforementioned characteristics; meanwhile, the Set-Membership leads to reduced computational complexity, fast convergence, and high accuracy. To validate the performance, we compare twelve models applied in eight datasets, seven of which are benchmarks, and one is composed of hot box and hot wheels problems. The results showed that the methods had a good performance when applied to these problems.
... Moving on to real-world datasets, we analyzed five datasets from the UCI repository [15]: Breast Cancer Wisconsin (BCW), Seed, Ionosphere (ION), Iris, and Wine. The specifications of these datasets are presented in Table 4. Notably, it's recognized that clustering the iris data can result in either 2 or 3 partitions, as indicated in [11]. ...
Preprint
Full-text available
The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter $m$ is set to a large value. Our R package called WPfuzzyCVIs used in this work is also available in https://CRAN.R-project.org/package=UniversalCVI.
... The Aggregation dataset represents a cluster-connected dataset, Compound consists of clustered datasets with uneven cluster density, R15 represents clustered datasets with uniform but unconnected density, Spiral represents a bar-like dataset with uniform density, and P2Glob contains two different shapes with uniform density. Among the six real datasets, the Sym dataset was obtained from the data mining software Waikato Environment for Knowledge Analysis (WEKA), while the remaining datasets were sourced from the UCI machine learning repository [40]. Detailed descriptions of the datasets and their clustering effects can be found in Sections B and D of part IV. ...
Article
Full-text available
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a classic density-based clustering method that can identify clusters of arbitrary shapes in noisy datasets. However, DBSCAN requires two input parameters: the neighborhood distance value (Eps) and the minimum number of sample points in its neighborhood (MinPts), to perform clustering on a dataset. The quality of clustering is highly sensitive to these two parameters. To tackle this issue, this paper introduces a parameter-adaptive DBSCAN clustering algorithm based on the Whale Optimization Algorithm (WOA-DBSCAN). The algorithm determines the parameter range based on the dataset distribution and utilizes the silhouette coefficient as the objective function. It iteratively selects the two input parameters of DBSCAN within the parameter range using the WOA. This approach ultimately achieves adaptive clustering of DBSCAN. Experimental results on five typical artificial datasets and six real UCI datasets demonstrate the effectiveness of the proposed WOA-DBSCAN algorithm. Compared with DBSCAN and its related optimization algorithms, WOA-DBSCAN shows significant improvements. The F-values of WOA-DBSCAN increased by 9.8%, 13.2%, and 2% respectively in two-dimensional artificial datasets. Additionally, the accuracy values on low to medium dimensional real datasets increased by 22.3%, 10%, and 23.3%. Hence, WOA-DBSCAN can maintain the clustering ability of DBSCAN while achieving adaptive parameter clustering.
... Model name: Google Colab. Datasets and baselines: We perform our experiments on "Bag-of-Words" representations of text documents [23]. We use the following datasets: NYTimes news articles (number of points = 300000, dimension = 102660), Enron emails (number of points = 39861, dimension= 28102), and KOS blog entries (number of points = 3430, dimension = 6960). ...
Preprint
In their seminal work, Broder \textit{et. al.}~\citep{BroderCFM98} introduces the $\mathrm{minHash}$ algorithm that computes a low-dimensional sketch of high-dimensional binary data that closely approximates pairwise Jaccard similarity. Since its invention, $\mathrm{minHash}$ has been commonly used by practitioners in various big data applications. Further, the data is dynamic in many real-life scenarios, and their feature sets evolve over time. We consider the case when features are dynamically inserted and deleted in the dataset. We note that a naive solution to this problem is to repeatedly recompute $\mathrm{minHash}$ with respect to the updated dimension. However, this is an expensive task as it requires generating fresh random permutations. To the best of our knowledge, no systematic study of $\mathrm{minHash}$ is recorded in the context of dynamic insertion and deletion of features. In this work, we initiate this study and suggest algorithms that make the $\mathrm{minHash}$ sketches adaptable to the dynamic insertion and deletion of features. We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets. Empirically we observe a significant speed-up in the running time while simultaneously offering comparable performance with respect to running $\mathrm{minHash}$ from scratch. Our proposal is efficient, accurate, and easy to implement in practice.
... The coordinates of the variable value for the observation ( ⃗⃗⃗⃗ ) is calculated by adding ⃗⃗⃗ to the product of and ̂ as shown in Equation (2). To exemplify the previous two steps, a numerical example is used based on the Iris data, a wellknown benchmark dataset from the UCI machine learning repository (Bache & Lichman, 2013). ...
Thesis
Full-text available
Analysis of industrial data imposes several challenges. This type of data is collected from heterogeneous sources and stored in isolated silos. This hinders the accessibility of this asset and its proper exploitation in the decision making process. Therefore, there is an urgent and prioritized need to merge these disconnected silos to maximize the benefits from this asset. However, merging data from multiple sources has two main challenges: efficient data representation for better data fusion, and selecting the most appropriate modeling technique based on the type and quantity of data as well as the nature of the problem to be solved. Solving these challenges will help the complex industrial systems maintain energy-efficient operations, reduce their environmental footprint, and thus achieve operational excellence. This thesis makes three contributions as a step for solving the problems of the data fusion resulting from the operation of industrial systems. The objective of the three proposed contributions is to obtain an efficient and useful data representation that helps maximize the global value of the available industrial data to facilitate the fusion process at different semantic levels (raw, information and knowledge levels). Those contributions make use of the performance of deep learning (DL) in predictive and generative modeling. They allow for building accurate and robust models to be used for several purposes in industry. These models can accurately diagnose various industrial systems, predict their key performance indicators (KPIs) and capture the true distribution of highly non-linear complex industrial systems to provide the expert with valuable knowledge to prescribe the right actions at the right time. All the proposed methods in this doctoral research are validated using challenging datasets collected from different complex equipment in process industries: reboiler system in a theromomechancal pulp mill (TMP), black liquor recovery boiler (BLRB) and concentrator of black liquor in a Kraft pulp mill. The methods were compared to different machine learning and deep learning techniques extensively used in the literature. Our results indicate that our methods outperform the comparable methods in terms of prediction accuracy. The results obtained show that our proposed approach leads to accurate predictions and successfully captures the non-linearity and dynamic behavior of complex industrial systems with ill-defined distributions. This doctoral research opens the door to the industrial heterogeneous data fusion including sensory data, images, and videos.
... Datasets and methods To assess the performance of adversarially de-biased classifier and AdOpt, we consider several benchmark datasets: the UCI datasets Census Income("Adult") and Default of Credit Card Clients (Lichman et al. 2013), as well as the anonymized dataset of home equity line of credit (HELOC) (FICO. Explainable machine learning challenge) and MNIST converted to the binary reward format of the BLP by defining the positive class to be images of the digit 5. "Adult" contains demographic information useful for fairness evaluation, and was used in in (Pacchiano et al. 2021) alongside MNIST, while FICO and "Credit" are publicly available real life datasets used for financial credibility evaluation. ...
Preprint
In many real world settings binary classification decisions are made based on limited data in near real-time, e.g. when assessing a loan application. We focus on a class of these problems that share a common feature: the true label is only observed when a data point is assigned a positive label by the principal, e.g. we only find out whether an applicant defaults if we accepted their loan application. As a consequence, the false rejections become self-reinforcing and cause the labelled training set, that is being continuously updated by the model decisions, to accumulate bias. Prior work mitigates this effect by injecting optimism into the model, however this comes at the cost of increased false acceptance rate. We introduce adversarial optimism (AdOpt) to directly address bias in the training set using adversarial domain adaptation. The goal of AdOpt is to learn an unbiased but informative representation of past data, by reducing the distributional shift between the set of accepted data points and all data points seen thus far. AdOpt significantly exceeds state-of-the-art performance on a set of challenging benchmark problems. Our experiments also provide initial evidence that the introduction of adversarial domain adaptation improves fairness in this setting.
Article
Full-text available
Clustering ensemble can be regarded as a mathematical optimization problem, and the genetic algorithm has been widely used as a powerful tool for solving such optimization problems. However, the existing research on clustering ensemble based on the genetic algorithm model has mainly focused on unsupervised approaches and has been limited by parameters like crossover probability and mutation probability. This paper presents a semi-supervised clustering ensemble based on the genetic algorithm model. This approach utilizes pairwise constraint information to strengthen the crossover process and mutation process, resulting in enhanced overall algorithm performance. To validate the effectiveness of the proposed approach, extensive comparative experiments were conducted on 9 diverse datasets. The results of the experiments demonstrate the superiority of the proposed algorithm in terms of clustering accuracy and robustness. In summary, this paper introduces a novel semi-supervised approach based on the genetic algorithm model. The utilization of pair-wise constraint information enhances the algorithm’s performance, making it a promising solution for real-world clustering problems.
Article
Constrained clustering, such as $k$ -means with instance-level Must-Link (ML) and Cannot-Link (CL) auxiliary information as the constraints, has been extensively studied recently, due to its broad applications in data science and AI. Despite some heuristic approaches, there has not been any algorithm providing a non-trivial approximation ratio to the constrained $k$ -means problem. To address this issue, we propose an algorithm with a provable approximation ratio of $O(\log k)$ when only ML constraints are considered. We also empirically evaluate the performance of our algorithm on real-world datasets having artificial ML and disjoint CL constraints. The experimental results show that our algorithm outperforms the existing greedy-based heuristic methods in clustering accuracy.
Chapter
Full-text available
This paper presents the development of an artificial neural network (ANN) for the prediction of heart disease, along with a comprehensive data strategy aimed at improving the adoption of artificial intelligence (AI) in healthcare. The neural network architecture is carefully designed according to the dimensions of the data, transfer learning methods are used to increase generalizabil1ity, and hyperparameters are optimized to achieve high predictive accuracy. To address the challenges related to AI adoption in healthcare, a robust data strategy is devised, focusing on data quality, privacy, security, and regulatory compliance. The strategy incorporates comprehensive data governance frameworks, secure data sharing protocols, and privacy-preserving techniques to facilitate the responsible and ethical utilization of sensitive medical information. Furthermore, strategies for ensuring interoperability and scalability of AI systems within existing healthcare infrastructure are explored.
Article
Full-text available
Most of the dimensionality reduction algorithms assume that data are independent and identically distributed (i.i.d.). In real-world applications, however, sometimes there exist relationships between data. Some relational learning methods have been proposed, but those with discriminative relationship analysis are lacking yet, as important supervisory information is usually ignored. In this paper, we propose a novel and general framework, called relational Fisher analysis (RFA), which successfully integrates relational information into the dimensionality reduction model. For nonlinear data representation learning, we adopt the kernel trick to RFA and propose the kernelized RFA (KRFA). In addition, the convergence of the RFA optimization algorithm is proved theoretically. By leveraging suitable strategies to construct the relational matrix, we conduct extensive experiments to demonstrate the superiority of our RFA and KRFA methods over related approaches.
Article
A recent trend of fair machine learning is to build a decision model subjected to causality-based fairness requirements, which concern with the causality between sensitive attributes and decisions. Almost all (if not all) solutions focus on a single fair decision model and assume no hidden confounder to model causal effects in a too simplified way. However, multiple interdependent decision models are actually used and discrimination may transmit among them. The hidden confounder is another inescapable fact and causal effects cannot be computed from observational data in the unidentifiable situation. To address these problems, we propose a method called CMFL (Causality-based Multiple Fairness Learning). CMFL parameterizes the causal model by response-function variables, whose distributions capture the randomness of causal models. CMFL treats each classifier as a soft intervention to infer the post-intervention distribution, and combines the fairness constraints with the classification loss to train multiple decision classifiers. In this way, all classifiers can make approximately fair decisions. Experiments on synthetic and benchmark datasets confirm its effectiveness, the response-function variables can deal with the unidentifiable issue and hidden confounders.
Article
Based on the memoryless BFGS (Broyden–Fletcher–Goldfarb–Shanno) updating formula of a recent well-structured diagonal approximation of the Hessian, we propose an improved proximal method for solving the minimization problem of nonsmooth composite functions. More exactly, a diagonally scaled matrix is iteratively used to approximate Hessian of the smooth ingredient of the cost function, which leads to straightly determining the search directions in each iteration. Afterward, in light of the Zhang–Hager nonmonotone scheme, a nonmonotone technique for performing the line search for the unconstrained optimization models with composite cost functions is devised. What is more, we address convergence of the suggested proximal algorithm. We close the discussion by empirically studying performance of the proposed algorithm on some large-scale compressive sensing and sparse logistic regression problems.
Chapter
One of the main challenges in data mining is choosing the optimal number of clusters without prior information. Notably, existing methods are usually in the philosophy of cluster validation and hence have underlying assumptions on data distribution, which prevents their application to complex data such as large-scale images and high-dimensional data from the real world. In this regard, we propose an approach named CNMBI. Leveraging the distribution information inherent in the data space, we map the target task as a dynamic comparison process between cluster centers regarding positional behavior, without relying on the complete clustering results and designing the complex validity index as before. Bipartite graph theory is then employed to efficiently model this process. Additionally, we find that different samples have different confidence levels and thereby actively remove low-confidence ones, which is, for the first time to our knowledge, considered in cluster number determination. CNMBI is robust and allows for more flexibility in the dimension and shape of the target data (e.g., CIFAR-10 and STL-10). Extensive comparisof-the-art competitors on various challenging datasets demonstrate the superiority of our method.
Article
Full-text available
Multi Target Regression (MTR) is a machine learning method that simultaneously predicts multiple real-valued outputs using a set of input variables. A lot of emerging applications that can be mapped to this class of problem. In MTR method one of the critical aspect is to handle structural information like instance and target correlation. MTR algorithms attempt to exploit these interdependences when building a model. This results in increased model complexities, which in turn, reduce the interpretability of the model through manual analysis of the result. However, data driven real-world applications often require models that can be used to analyze and improve real-world workflows. Leveraging dimensionality reduction techniques can reduce model complexity while retaining the performance and boost interpretability. This research proposes multiple feature subset alternatives for MTR using genetic algorithm, and provides a comparison of the different feature subset selection alternatives in conjunction with MTR algorithms. We proposed a genetic algorithm based feature subset selection with all targets and with individual target keeping the structural information intact in the selection process. Experiments are performed on real world benchmarked MTR data sets and the results indicate that a significant improvement in performance can be obtained with comparatively simple MTR models by utilizing optimal and structured feature selection.
Article
Full-text available
Classification (supervised-learning) of multivariate functional data is considered when the elements of the random functional vector of interest are defined on different domains. In this setting, PLS classification and tree PLS-based methods for multivariate functional data are presented. From a computational point of view, we show that the PLS components of the regression with multivariate functional data can be obtained using only the PLS methodology with univariate functional data. This offers an alternative way to present the PLS algorithm for multivariate functional data. Numerical simulation and real data applications highlight the performance of the proposed methods.
Article
Full-text available
Vertical Federated Learning (VFL) has many applications in the field of smart healthcare with excellent performance. However, current VFL systems usually primarily focus on the privacy protection during model training, while the preparation of training data receives little attention. In real-world applications, like smart healthcare, the process of the training data preparation may involve some participant’s intention which could be privacy information for this participant. To protect the privacy of the model training intention, we describe the idea of Intention-Hiding Vertical Federated Learning (IHVFL) and illustrate a framework to achieve this privacy-preserving goal. First, we construct two secure screening protocols to enhance the privacy protection in feature engineering. Second, we implement the work of sample alignment bases on a novel private set intersection protocol. Finally, we use the logistic regression algorithm to demonstrate the process of IHVFL. Experiments show that our model can perform better efficiency (less than 5min) and accuracy (97%) on Breast Cancer medical dataset while maintaining the intention-hiding goal.
Article
Pathfinder algorithm (PFA) is a recently introduced meta-heuristic technique that mimics the cooperative behavior of animal groups in search of the best food area. PFA consists of two phases, namely, the path-finder phase and the follower phase. The former explores new search regions through its versatile explorative power, while in the latter stage, followers change their position by tracking the leader and using their perception. However, PFA is prone to falling into local optima, leading to a slow convergence rate while dealing with high-dimensional ill-conditioned problems. Therefore, this article proposes an improved PFA called ASDR-PFA based on an adaptation of the search dimensional ratio (ASDR) to address the issues in PFA. The proposed method incorporates an ASDR concept that uses a search dimensional ratio (SDR) parameter to generate new candidate solutions using the existing global best. Its strength lies in its dynamic updating of the SDR parameter, which further tunes the balance between exploration and exploitation processes. As a result, the convergence rate of PFA is enhanced. The effectiveness of ASDR-PFA is verified using a set of 16 basic benchmark functions and IEEE-CEC-2011 and IEEE-CEC-2017 problem suites. Wilcoxon's signed rank test is also conducted to confirm its statistical significance. Additionally, ASDR-PFA is utilized to tackle some optimal feature selection problems to substantiate its applicability. Too, a comparative assessment of empirical outcomes attained through ASDR-PFA and some modern meta-heuristics is carried out to showcase its suitability level.
Conference Paper
Full-text available
This paper presents the development of an artificial neural network (ANN) for the prediction of heart disease, along with a comprehensive data strategy aimed at improving the adoption of artificial intelligence (AI) in healthcare. This research begins by collecting high quality datasets and combining them to create a diverse and representative dataset comprising various clinical, demographic , and lifestyle factors associated with heart disease. Data cleaning, feature engineering, and a novel feature selection technique utilizing a genetic algorithm and other data mining techniques is used to train the ANN. The neural network architecture is carefully designed according to the dimensions of the data, transfer learning methods are used to increase generalizabil1ity, and hyperparameters are optimized to achieve high predictive accuracy. To address the challenges related to AI adoption in healthcare, a robust data strategy is devised, focusing on data quality, privacy, security, and regulatory compliance. The strategy incorporates comprehensive data governance frameworks, secure data sharing protocols, and privacy-preserving techniques to facilitate the responsible and ethical utilization of sensitive medical information. Furthermore, strategies for ensuring interoper-ability and scalability of AI systems within existing healthcare infrastructure are explored.
Preprint
Full-text available
The $k$-means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called $k$-means++ uses $D^{2}$ sampling and comes with a provable $O(\log k)$ approximation guarantee \cite{AV2007}. However, in the presence of adversarial noise or outliers, $D^{2}$ sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees \textit{w.r.t.} $k$-means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the $D^2$ sampling distribution, which makes it robust to the outliers. Our algorithm runs in $O(ndk)$ time, outputs $O(k)$ clusters, discards marginally more points than the optimal number of outliers, and comes with a provable $O(1)$ approximation guarantee. Our algorithm can also be modified to output exactly $k$ clusters instead of $O(k)$ clusters, while keeping its running time linear in $n$ and $d$. This is an improvement over previous results for robust $k$-means based on LP relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and \textit{robust $k$-means++} \cite{DeshpandeKP20}. Our empirical results show the advantage of our algorithm over $k$-means++~\cite{AV2007}, uniform random seeding, greedy sampling for $k$ means~\cite{tkmeanspp}, and robust $k$-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data sets used in previous work. Our proposal is easily amenable to scalable, faster, parallel implementations of $k$-means++ \cite{Bahmani,BachemL017} and is of independent interest for coreset constructions in the presence of outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}.
Article
Full-text available
Smart cities have seen a growing interest among governments, researchers, and industries. Smart cities use digital technologies to enhance the quality of life for residents while promoting sustainability and efficient resource management. By integrating various technologies such as Information and Communication Technologies (ICT), Artificial Intelligence (AI), and Internet of Things (IoT), smart cities can improve the delivery of public services, optimize transportation systems, reduce energy consumption, and enhance public safety, among other benefits. Smart cities focus on automating different disciplines, including smart environment, smart home, smart economy, smart mobility, and smart governance. As a result, multi-agent driven smart cities have received tremendous attention from the research community for obtaining intelligent solutions to complex problems in different disciplines by subdividing responsibilities into multiple agents and empowering agents through AI. In this regard, it is vital to explore the usage of multi-agent systems in different critical application areas of smart cities. In this paper, a detailed description of the multi-agent process for smart city application areas is provided, along with resources and future research directions. Four different application areas: smart home, smart governance, smart environment, and smart mobility are discussed in detail.
Chapter
Document classification is a prevalent task in natural language processing with broad applications in the biomedical domain, including biomedical literature. indexing, automatic diagnosis codes assignment, tweets classification for public health topics, patient safety reports classification, etc. In recent years, the categorization of biomedical literature plays a vital role in biomedical engineering. Nevertheless, manual classification of biomedical papers published in every year into predefined categories becomes a cumbersome task. Hence, building an effective automatic document classification for biomedical databases emerges as a significant task among the scientific community. Hence, this chapter investigates the deployment of the state-of-the-art machine learning (ML) algorithms like decision tree, k-nearest neighborhood, Rocchio, ridge, passive–aggressive, multinomial naïve Bayes (NB), Bernoulli NB, support vector machine, and artificial neural network classifiers such as perceptron, random gradient descent, BPN in automatic classification of biomedical text documents on benchmark datasets like BioCreative Corpus III (BC3), Farm Ads, and TREC 2006 genetics Track. Finally, the performance of all the said constitutional classifiers are compared and evaluated by means of the well-defined metrics like accuracy, error rate, precision, recall, and f-measure.
Article
Support vector quantile regression (SVQR) adapts the flexible pinball loss function for empirical risk in regression problems. Furthermore, \(\varepsilon -\)SVQR obtains sparsity by introducing the \(\varepsilon -\)insensitive approach to SVQR. Despite their excellent generalisation performance, the employed loss functions of SVQR and \(\varepsilon -\)SVQR still possess noise and outlier sensitivity. This paper suggests a new robust SVQR model called a robust support vector quantile regression with truncated pinball loss (RSVQR). RSVQR employs a truncated pinball loss function for reducing the impact of noise. The employed loss function takes a non-convex structure which might lead to a local optimum solution. Further, to solve the non-convex optimization problem formulated using the employed non-convex loss function, we apply the concave–convex procedure (CCCP) to the cost function of the proposed method which decomposes the total loss into one convex and one concave part. Few interesting artificial and real-world datasets are considered for the experimental analysis. Support vector regression (SVR), Huber loss-based SVR (HSVR), asymmetric HSVR (AHSVR), SVQR and \(\varepsilon -\)SVQR are used to compare the results of the suggested model. The obtained results reveal the applicability of the proposed RSVQR model.
ResearchGate has not been able to resolve any references for this publication.