ArticlePDF Available

Abstract and Figures

Analyzing high-dimensional data is a major challenge in the field of data mining. Features selection is an effective way to remove irrelevant information from the data. Prior research has utilized the Apriori frequent-set mining approach to discover the relevant and interrelated features in the health domain. However, the comparison of the Apriori algorithm with other features selection approaches is absent in the literature. This paper aims to compare the state-of-the-art features selection techniques with Apriori in the presence of thousands of features in a healthcare dataset. After the features are selected we perform a three-class classification using a number of machine-learning algorithms where patients are classified according to the pain medication they consume. Results revealed that among LASSO, ridge regression, PCA, information gain, Apriori, and correlation-based features selection techniques, LASSO followed by classification gave the highest accuracy. We highlight the implications of using feature-selection algorithms before classification in healthcare datasets.
Content may be subject to copyright.
Comparative Analysis of Features Selection Techniques
for Classification in Healthcare
Shruti Kaushik1,a, Abhinav Choudhury1,b, Ashutosh Kumar Jatav2,c, Nataraj
Dasgupta3,d, Sayee Natarajan3,e, Larry A. Pickett3,f, and Varun Dutt1,g
1Applied Cognitive Science Laboratory, Indian Institute of Technology Mandi, Himachal
Pradesh, India – 175005
2Indian Institute of Technology Jodhpur, Rajasthan, India - 342037
3RxDataScience, Inc., USA - 27709,,,,,,
Abstract. Analyzing high-dimensional data is a major challenge in the field of
data mining. Features selection is an effective way to remove irrelevant
information from the data. Prior research has utilized the Apriori frequent-set
mining approach to discover the relevant and interrelated features in the health
domain. However, the comparison of the Apriori algorithm with other features
selection approaches is absent in the literature. This paper aims to compare the
state-of-the-art features selection techniques with Apriori in the presence of
thousands of features in a healthcare dataset. After the features are selected we
perform a t hree-class classification using a number of machine-learning
algorithms where patients are classified according to the pain medication they
consume. Results revealed that among LASSO, ridge regression, PCA,
information gain, Apriori, and correlation-based f eatures selection techniques,
LASSO followed by classification gave the highest accuracy. We highlight the
implications of using feature-selection algorithms before classification in
healthcare datasets.
Keywords: Features selection, Dimensionality reduction, Machine Learning, Healthcare,
High dimensional dataset.
1 Introduction
In the past few years, there have been significant developments in how machine
learning (ML) can be used in various industries and research [1]. In recent years, ML
algorithms have also been utilized in the healthcare sector [2]. The existence of
electronic health records (EHRs) has allowed researchers to apply ML algorithms for
public health management, diagnoses, disease treatments, and for analyzing patients’
medical history [3, 12]. Mining hidden patterns in healthcare data sets could help
healthcare providers and pharmaceutical companies timely plan quality healthcare for
patients in need [12].
In general, healthcare datasets are large, and they may contain several thousands of
attributes [15]. To predict healthcare outcomes accurately, ML algorithms need to
focus on selecting relevant features from data [4]. However, the presence of
thousands of attributes in data may become problematic for classification algorithms
for processing these attributes as several features may involve large memory usage
and high computational costs [13]. Generally, two feature-engineering techniques
have been suggested in the literature to address the problem of datasets possessing a
large number of features: feature reduction (dimensionality reduction) and feature
selection [14]. Feature reduction technique reduces the number of attributes by
creating new combinations of attributes; whereas, feature selection techniques include
and exclude attributes present in data without changing them [14]. In the past,
algorithms like principal component analysis (PCA) have been used for feature
reduction [5]. Also, filter methods (e.g., information gain, correlation coefficients
score), wrapper methods (e.g., recursive features elimination), and embedded methods
(e.g., LASSO, ridge regression) have been used for features selection in the past [6].
Another way for feature selection is by using frequent item-set mining algorithms
(e.g., Apriori algorithm) [15]. Apriori selects a subset of features by looking at the
associations among items while selecting frequent item-sets [7].
Kaushik et. al, [15] compared the classification accuracies of various algorithms
when all features were present and when only selected features from Apriori
algorithm were present in classification models. Classification accuracies were higher
in data after Apriori algorithm compared to data where all features were present [15].
Although there has been prior research on comparing a number of features selection
techniques [6]; however, a comparison of these feature selection techniques with the
Apriori algorithm has been less explored. In this paper, we address this gap in the
literature and compare the performance of the Apriori algorithm for feature selection
with various other feature selection algorithms like PCA, information gain,
correlation coefficients score, LASSO, and ridge regression.
Specifically, we take a healthcare dataset involving consumption of two pain
medications in the US, and we apply different feature selection techniques including
the Apriori algorithm for feature selection and subsequent classification. Most of the
attributes in our dataset are the unique diagnoses and procedures associated with
patients. These diagnoses and procedures make sparse data values as there may be
variability present among patients in terms of the diagnoses and procedures associated
with each patient. Also, diagnoses and procedures might be inter-related in the dataset
which may help certain feature-selection algorithms compared to the Apriori
algorithm. Furthermore, to get confidence in our results, we try to solve a 3-class
classification task by applying several ML algorithms such as the decision tree [22],
Naïve Bayes classifier [23], logistic regression [24], and support vector machine
(SVM) [25] on the selected features.
In what follows, we first provide a brief review of related literature. Next, we
explain the methodology of applying various features selection techniques. In Section
IV, we present our experimental results and compare classification accuracies after
selecting features from different features selection techniques. Finally, we discuss our
results and conclude our paper by highlighting the main implications of this research
and its future scope.
2 Background
Researchers have applied data mining algorithms to mine the hidden knowledge in the
healthcare domain [2, 8, 12]. For example, various data-mining approaches such as
clustering, classification, and association-rule mining have made a significant
contribution to medical research [8, 12].
Prior research has utilized information gain and certain hybrid approaches for
features selection and developed classification systems for predicting chronic diseases
[17]. Harb & Desuky have used correlation based features selection approach to
improve the classification of medical datasets [18]. Kamkar et al. have used the
LASSO approach for features selection and compared it with information gain to
build clinical prediction models [27]. Also, Fonti and Belitser have compared LASSO
and ridge regression to perform features selection on high dimensional datasets [21].
Moreover, researchers have also evaluated PCA for features selection in image
recognition tasks [5]. These researchers have used various ML techniques like J48,
Naïve Bayes, SVM, Bayesian Networks, neural networks, and K-nearest neighbor
approach for performing classification on the selected features set [15-18].
Furthermore, researchers have used frequent-set mining algorithms like Apriori to
find the associations between diagnosis and treatment [9]. Besides, certain researchers
have focused on identifying frequent diseases using Apriori [10], and others have
discovered frequently appearing diagnoses and procedures for consumers of certain
pain medications [15]. Some researchers have used Apriori to reveal the unexpected
patterns in diabetic data repositories [11] and to predict the risks of heart diseases
However, to the best of authors’ knowledge, prior research has yet to compare
Apriori algorithms (for selecting features) with the other well-known features
selection techniques to predict the healthcare outcomes. In this paper, we attend to
this literature gap and perform features selection using Apriori frequent-set mining
algorithm, information gain, correlation coefficient score, LASSO, ridge regression,
and PCA for classifying patients based on pain medications they consumed. To
compare these approaches, we take specific healthcare dataset where there are
potentially thousands of features to choose between, and then we investigated the
classification accuracies of different ML algorithms using the selected features set.
2.1 Feature Selection Methods
Feature selection methods have mainly three categories: wrapper, filter, and
embedded [17]. The wrapper methods assess subsets of variables according to their
usefulness to a given predictor. Filter methods use the general characteristics of data
itself and work separately from the learning algorithm. The filter methods use the
statistical correlation between a set of features and the target feature. The amount of
correlation between features and the target variable determine the importance of target
variable [17]. Filter based approaches are not dependent on classifiers and are usually
faster and more scalable. Also, they have low computational complexity. Examples of
filter methods includes information gain, correlation coefficient, and chi-square test.
Moreover, some learning algorithms perform features selection as part of their
overall operation. These include regularization techniques such as LASSO (L1
regularization) and ridge regression (L2 regularization). These two techniques are part
of the embedded approach for features selection.
Apart from these methods, PCA as a feature reduction technique could be used to
find relevant features in data [5]. PCA tries to find the direction of most variations in
data and then transforms the original attributes space to the new space of maximum
In this paper, we used the following features selection techniques: two filter
methods (information gain and correlation coefficients score), two embedded methods
(LASSO and ridge regression), one features reduction technique (PCA), and one
frequent item-set mining approach (Apriori) for finding the relevant features. We
investigated the classification accuracy of different ML algorithms using the features
selected from the approaches mentioned above.
2.1.1 Information Gain
Information gain (IG) can detect the features possessing the maximum information
based on a specific class [19]. It is one of the most important feature ranking methods,
which measures dependency between a feature and a class label. IG of  feature
and class is calculated as:
    
Where  is the entropy and is a measure of the uncertainty of a random
variable. If we assume that we have a two-class classification problem (y=0 and y=1
are the class labels) then  and 
are defined as:
          (2)
  
  
   
  
In this technique, for each feature, we calculate information gain (IG)
independently, and top k features are selected as the final feature set.
2.1.2 Correlation-based Feature Selection (CFS)
CFS is a simple filter algorithm that ranks feature subsets and discovers the merit of a
feature or a subset of features according to a correlation based heuristic evaluation
function [20]. The purpose of CFS is to find subsets that contain features that are
highly correlated with the class and uncorrelated with each other. The redundant
features are excluded, as they will be highly correlated with one or more of the
remaining features. The acceptance of a feature will depend on the extent to which it
predicts classes in areas of the instance space not already predicted by other features
[20]. CFS’s feature-subset evaluation function is shown as follows:
 
    
Where is the heuristic “merit” of a feature subset S containing k features,
 is the mean feature-class correlation (f ϵ s), and
 is the average feature-
feature inter-correlation. This equation is, in fact, Pearson’s correlation, where all
variables have been standardized. The numerator can be thought of as indicating how
predictive of the class a group of features is; the denominator is an indication of how
much redundancy there is among them (features).
2.1.3 Least Absolute Shrinkage and Selection Operator (LASSO)
LASSO is a powerful method that performs mainly two tasks: regularization and
feature selection [21]. The LASSO method puts a constraint on the sum of the
absolute values of the model parameters; the sum has to be less than a fixed value
(upper bound). The method applies a shrinking (L1 regularization) process where it
penalizes the coefficients of the regression variables shrinking some of them to zero.
During features selection process the variables that still have a non-zero coefficient
after the shrinking process are selected to be part of the model. The goal of this
process is to minimize the prediction error. In practice, the tuning parameter λ, that
controls the strength of the penalty, assume great importance. Indeed, when λ is
sufficiently large then coefficients are forced to be exactly equal to zero, this way
dimensionality can be reduced. The larger is the parameter λ, the more number of
coefficients are shrunk to zero. On the other hand, if λ = 0 we have an OLS (Ordinary
Least Square) regression.
2.1.4 Ridge Regression
Ridge regression works by penalizing the magnitude of coefficients of features along
with minimizing the error between predicted and actual observations [21]. This is a
regularization technique like LASSO. It performs L2 regularization where it adds the
penalty equivalent to the square of the magnitudes of coefficients.
2.1.5 Principal Component Analysis (PCA)
PCA does not directly select the features; it is a dimension reduction technique. PCA
aims to reduce the dimensionality of a dataset that contains a large number of
correlated attributes by transforming the original attributes space to a new space in
which attributes are uncorrelated [5]. The algorithm then ranks the variation between
the original dataset and the new one. The transformed attributes with most variations
are saved, and the rest of attributes are discarded. It is also important to mention that
PCA is an unsupervised technique because it does not take into account the class
2.1.6 Apriori Algorithm
The Apriori algorithm [7] is used for finding the frequent item-sets in a transaction
database. It uses an iterative level-wise approach to generate the frequent item-sets.
This algorithm works in the following steps:
1. The transactions in database D are scanned to determine frequent 1-itemsets,
that possess the minimum support, where support of an itemset X is defined
as the proportion of the transactions that contain the item-set X in the database
2. Generate candidate k item-sets from joining two k-1 itemsets, , and
remove its infrequent subset.
3. Scan D to get support count for each k item-sets, .
4. The set of frequent k item-sets,, is then determined. results from support
count of candidate k-1 item-sets.
5. Back to step 2 until there is no candidate k+1 item-sets, .
6. Extract the frequent k item-sets, L = .
After selecting the relevant features from these features selection techniques, we
formed a classification problem. We applied the following ML algorithms to classify
our data: decision tree, Naïve Bayes, logistic regression, and support vector machine.
2.2 Machine-Learning Algorithms
2.2.1 Decision Tree
Decision tree is a classification algorithm that classifies class instances by sorting
them down the tree from root to the leaf node. Each node in the decision tree specifies
a test on an attribute of the instance, and each branch descending from the node
corresponds to one possible value of the attribute. Following assumptions are taken
into account while creating a decision tree [22]:
1. Initially, the complete set of training attributes is evaluated at the root node.
2. Categorical feature values are preferred to continuous ones. Continuous values
need to be discretized before building the model.
3. Attribute values are used to recursively distribute the records.
4. Entropy and gain is calculated for each attribute to decide their placement
within the decision tree.
The main challenge in a decision tree is selecting which attribute to select for each
node in the tree. Random selection of attributes for nodes leads to very low accuracy
[22]. We have used the information-gain measure to identify the attribute which can
be considered as the root node at each level.
Information Gain: Information gain is based on the concept of entropy from
information theory. We assume attributes to be categorical while using information
gain as an attribute selection criterion. Entropy is defined as [22]:
    
 (5)
Where    are fractions that add up to 1 and represent the percentage of
each class present in the child node that results from a split in the tree [22].
Furthermore, Information Gain is defined as:
Information Gain = Entropy (parent) - Weighted Sum of Entropy (children)
   
Where a is an attribute in data. Information gain (IG) calculates the expected
reduction in entropy due to sorting on the attribute. At any node, attributes with the
maximum value of information gain are preferred over other attributes.
2.2.2 Naïve Bayes
Naïve Bayes is probabilistic classifier that is based on the Bayes theorem. It is called
naïve because it assumes a strong independence assumption between features [23]. It
assumes that the value of a particular feature is independent of the value of any other
feature, given the class variable. Despite this assumption, Naïve Bayes has been quite
successful in solving practical problems in text classification, medical diagnosis and
system performance management [23]. The classifier attempts to maximize the
posterior probability in determining the class of a transaction.
Suppose, vector y = (,,…,) represents the features in the problem with n
denoting the total number of features and K be the possible number of classes .
Naïve Bayes is a conditional probability model which can be decomposed as [23]:
 
 (7)
Under the independence assumption, the probabilities of the attributes are defined as
follows [23]:
    
 (8)
This most probable class is then picked based on the maximum a posteriori (MAP)
decision rule [23] as follows:
  
 (9)
2.2.3 Logistic Regression
Logistic regression is a linear classifier which is used to model the relationship
between one dependent binary variable and one or more independent variables [24]. It
models the posterior probabilities of the k number of classes in an instance. The
simple logistic regression is defined as:
   
Where y is the predicted output, a0 is the bias or intercept term, and a1 is the
coefficient for the input value (x).
2.2.4 Support Vector Machines
Support vector machines (SVM) are supervised classification techniques which are
accurate and robust even for small training samples [25]. Furthermore, they have the
ability to handle the large feature spaces. SVMs are the binary classifiers which can
be used for multi-class classification tasks as well. They build a hyperplane or a set of
hyperplanes in a high dimensional space which can be used for classification and
regression-based tasks. SVMs can classify linearly as well as non-linearly separable
data [25]. If the data is linearly separable, then SVM uses the linear hyperplane to
perform classification. However, for the non-linear data, rather than fitting non-linear
curve, it transforms the data into high dimensional space to perform classification.
SVM uses the kernel functions, e.g., radial basis function (RBF kernel) to transform
the data into the high dimensional plane for classifying the non-linear data [25]. For
better classifications, we optimize the support weights to minimize the objective
(error) function.
3 Method
3.1 Data
In this paper, we used the Truven MarketScan® health dataset containing patients’
insurance claims in the US [15]. The data set contains approximately 45,000 patients,
who consumed two pain medications, medicine A, medicine B, or both between
January 2011 and December 2015.1 The dataset contains patients’ demographic
variables (age, gender, region, and birth year), clinical variables (admission type,
diagnoses made, and procedures performed), the name of medicines, and medicines’
refill counts per patient. The dataset contains 55.20% records of patients who
consumed medicine A only, 39.98% records of medicine B only, and 4.82% records
for those patients who consumed both these medications. There were 15,081 attributes
present in total against each patient in this dataset, out of which 15,075 attributes were
diagnoses and procedure codes some of which were inter-related. We applied the
features selection algorithms (Information gain, correlation coefficients score,
LASSO, ridge regression, PCA, and Apriori) on 15,075 diagnoses and procedure
codes to select the relevant features and then combined the selected features along
with the other independent (6) features. Table 1 shows the list of 6-features that were
used along with the selected features from different features selection techniques in
different ML algorithms. The ML algorithms classified patients according to the
medications consumed by them, i.e., medicine A, medicine B, or both.
Table 1. Description of Input Features for Classification Problem
Features Description
Age group
7, 18
Northeast, northcentral, south, west,
Type of admission
Surgical, medical,
maternity and newborn,
psych and substance abuse, unknown
Refill count Count in number
Pain medication
A, B, Both
1 Due to a non-disclosure agreement, we have anonymized the actual names of these
3.2 Model Calibration
3.2.1 Features Selection
First, we performed the features selection using the Apriori algorithm. Apriori
algorithm gives the frequently appearing items in the dataset. With 3% support, we
found 9 frequently appearing diagnoses and procedures out of 15,075 diagnoses and
procedures using the Apriori algorithm. The 3% support was chosen after a sensitivity
analysis where the male-female ratio of frequently appearing diagnoses and
procedures were checked [15]. In order to compare the other features selection
techniques with the Apriori method, we selected the top 9 features from information
gain, CFS, LASSO, ridge regression, and PCA as well.
3.2.2 Information Gain
We calculated information gain for each feature for the output variable. Information
gain values vary from 0 (no information) to 1 (maximum information). Those features
that contribute more information will have a higher information gain value and can be
selected, whereas those that do not add much information will have a lower score and
can be removed. Furthermore, using the ranker search [28] method we obtained a
ranked list of top 9 attributes. The search method is the technique by which we try to
navigate different combinations of attributes in the dataset in order to arrive on a short
list of chosen features.
3.2.3 CFS
CFS calculates the Pearson correlation between each feature and the output variable
(class) and selects only those features that have a moderate-to-high positive or
negative correlation (close to -1 or 1) and drop those features with a low correlation (a
value close to zero). Similar to the information gain method, we used the ranker
search [28] approach to obtain a list of top 9 attributes.
3.2.4 LASSO
LASSO is a regularization and features selection method. As described in the section
above, the parameter λ controls the strength of the penalty [21]. The larger the amount
of λ, the greater is the shrinkage. We adjusted the value of λ in such a way that we get
exactly 9 most relevant attributes out of 15,075 diagnoses and procedures. The value
of λ in our paper is 0.0027.
3.2.5 Ridge Regression
As we know, ridge regression is similar to LASSO. However, for the same value of λ,
the coefficients cannot be equal to zero using ridge regression [21]. Therefore, we
ranked the attributes (based on the magnitude of their coefficients) given by ridge
regression for λ = 0.0027 and selected the top 9 attributes from them.
3.2.6 PCA
As explained above, PCA reduces dimensions by using the original features set. It
does not select the features as the other techniques discussed in this paper. As its
name says, PCA finds the principal components in the data. Principal components are
the directions where the data is most spread out or the directions with the most
variance. Implementing a PCA is just finding the Eigenvalues and Eigenvectors of the
data’s correlation matrix [5]. Eigenvectors and Eigenvalues exist in pairs. Every
Eigenvector has a corresponding Eigenvalue. Eigenvector gives the direction, and
corresponding Eigenvalues (which is a number) tells how much variance there is in
the data in that direction. In this paper, we selected the 9 principal components (9
directions or Eigenvectors) with the 9 highest Eigenvalues. This means that we have
transformed our data in the direction of 9 principal components. Furthermore, these 9
directions were able to cover the 15.85% variance of the whole data.
Table 2 shows the list of features selected from different features selection
Table 2. Description of Selected Features from Different Feature Selection Techniques
Apriori Information Gain and
CFS* LASSO Ridge Regression
Total knee arthroplasty Blisters, epidermal loss Chronic pain
Oxygen supplie s
Osteoarthrosis secondary
lower leg Third-de gree perineal
Opioid type
Closed fracture of
base of skull
without mention of
intra cranial injury,
unspecified state
of consciousness
Removal of foreign body
from eye Traumatic amputation of
arm and hand
Other chronic pain Artery bypass
Total knee replacement Under cardiac
Opioid type
Facial nerve injury
due to birth trauma
Osteoarthrosis primary
lower leg Arthropathy associated with
other endocrine and
metabolic disorders
Backache Allergic rhinitis
due to food
generalized lo wer leg Closed dislocation Diagnostic
Procedures of Spine
and Pelvis
evaluation of fine
needle aspirate
Total hip arthroplasty Malignant neoplasm of
Treat thigh
Fasciolopsiasis Basal cell carcinoma of skin
of other and unspecified
parts of the face
Degeneration of
lumbar or
intervertebral disc
Drowning and
submersion due to
other accident to
Total hip replace ment Unspecified malignant
neoplasm of skin
Tobacco use
Psychotherapy for
60 minutes
3.2.7 Machine Learning for Classification
For the ML analyses, the dataset was divided into two parts: 70% of the data was used
for training, and 30% of the data was used for testing. Our ML problem is a three-
class problem, where we classified a patient according to the medication
consumption. So, a patient can be classified under class A, class B or both. We used
the 9 features selected from different algorithms (see Table 2) along with the other 6
features (see Table 1) to train our ML models. Therefore, all the ML models were
trained with 15 features in total for classifying the patients into three classes.
4 Results
We applied various ML algorithms like Naïve Bayes, decision tree, logistic
regression, support vector machine (linear kernel), and support vector machine (radial
kernel) on our dataset and compared their classification accuracy. We used 6 different
features selection approaches in this paper. Fig. 1 shows the classification accuracy on
training data from different ML algorithms for the three-class classification problem.
Fig. 2 shows the classification accuracy on test data from different ML algorithms for
the three-class classification problem. The x-axis in Fig. 1 and Fig. 2 shows the
different features selection techniques and the y-axis shows the accuracy as a
percentage. We found that all the ML algorithms gave the best accuracy on test data
when their features were selected using the LASSO features selection approach. On
test data, the best accuracy of 59.04% was achieved from logistic regression with
features selected using LASSO. The second best accuracy of 58.56% was achieved
from SVM (radial kernel) with features selected using LASSO. The third best
accuracy of 57.5% was achieved from SVM (radial kernel) with features selected
using PCA.
Furthermore, on test data, the best accuracy with features selected using
information gain and CFS was 56.99% from SVM (radial kernel) algorithm.
*We obtained same features from Information gain and CFS techniques.
Similarly, on test data, the best accuracy with features selected using Apriori and
ridge regression was 56.97% from SVM (radial kernel) algorithm.
Fig. 1. The classification accuracy on training data from different ML algorithms.
Fig. 2. The classification accuracy on test data from different ML algorithms.
Apriori P CA LASSO Ridge
Accuracy (%)
Algorithms for Features Selection
Naïve Bayes
Logistic Regression
Decision Tree
SVM (Linear kernel)
SVM (Radial kernel)
Apriori PCA LASSO Ridge
Accuracy (%)
Algorithms for Features Selection
Naïve Bayes
Logistic Regression
Decision Tree
SVM (Linear kernel)
SVM (Radial kernel)
5 Discussion and Conclusions
Medical datasets contain multiple patient-related features. Most of these features are
the diagnoses or procedures that the patient has undergone throughout his treatment.
[15]. Several of these features could be inter-related or interdependent and can
influence the medication that they consume. However, in order to classify patients
according to the medicine they consume, we need to first select the right subset of
these diagnoses and procedures (features). There are various state-of-the-art features
selection techniques available in the literature [5-6]. All of these techniques follow
different mechanism to select the relevant features in the dataset. Furthermore,
researchers have also checked the potential of Apriori algorithm [15] to select the
frequently appearing diagnoses and procedures in the medical dataset. In this paper,
our primary objective is to compare the PCA, information gain, correlation
coefficients score, LASSO, and ridge regression with the Apriori algorithm to select
the relevant features before applying machine learning algorithms for classifying the
patients according to the type of medication they consume, i.e. medicine A, medicine
B, or both. There were 15,075 diagnoses and procedures for about 45,000 patients in
the dataset. We selected the top 9 most relevant features from all the features selection
techniques. After combining these 9 (selected) diagnoses and procedures with 6 other
demographic and clinical features (15 in total), we applied naïve Bayes, decision tree,
logistic regression, SVM (linear kernel), and SVM (radial kernel) to classify the
First, we found that all the ML algorithms had the highest accuracy when we used
LASSO method for feature selection. This result is likely because LASSO is an L1
regularization and regression technique, which creates a penalized model for having
too many variables in the model [21]. The consequence of imposing this penalty is to
reduce the coefficient values towards zero. This allows the less contributing variables
to have a coefficient close to zero or equal zero. Therefore, LASSO selects only
relevant features which have the maximum contribution towards predicting the class
variable. This could be a likely reason why the performance of the classifiers
improved with the features selected from LASSO.
Second, we found that the SVM (radial kernel) gave the highest accuracy (on test
data) when it used features selected from Apriori, PCA, ridge regression, information
gain and correlation coefficients score based techniques. Only in the case of LASSO,
logistic regression (59.02% classification accuracy on test data) performed better than
SVM (radial kernel; 58.56% classification accuracy on test data). However, the
difference in their classification accuracy is just marginal. One of the possible reasons
could be that SVM (radial kernel) performs well when the nature of data is non-linear.
Furthermore, prior research has compared the Apriori approach of features
selection with the case when all features are present [15]. Thus, Apriori was not
compared with other feature selection approaches. In this paper, we performed this
comparison and we found that Apriori performed similar to the ridge regression,
information gain, and CFS approaches. However, LASSO approach performed better
than Apriori approach on this dataset. From our findings, we conclude that it is a good
practice to perform features selection before applying machine learning. Furthermore,
LASSO may be used as a feature selection approach in datasets where we deal with
thousands of inter-related features.
In this paper, we compared the traditional features selection approaches on a
healthcare dataset involving several attributes. However, recent literature on deep-
learning has revealed the effectiveness of using different forms of autoencoders for
feature selection [26]. Thus, as part of our future work, we plan to extend our
investigation by applying different forms of autoencoders on this dataset. These ideas
form the immediate next steps in our machine-learning research program in the
healthcare domain.
Acknowledgment. The project was supported by grants (awards:
#IITM/CONS/PPLP/VD/03 and # IITM/CONS/RxDSI/VD/16) to Varun Dutt.
1. Bhardwaj, R., Nambiar, A. R., & Dutta, D.: A Study of Machine Learning in Healthcare.
In Computer Software and Applications Conference (COMPSAC), IEEE 41st Annual,
Vol. 2, pp. 236-241 (2017).
2. Oswal, S., Shah, G., and Student, P. G.: A Study on Data Mining Techniques on
Healthcare Issues and its uses and Application on Health Sector. International Journal of
Engineering Science. 13536 (2017).
3. Sharma, A., & Mansotra, V.: Emerging applications of data mining for healthcare
management-a critical review. In Computing for Sustainable Global Development
(INDIACom), IEEE International Conference, pp. 377-382. (2014).
4. Parikh R. B., Obermeyer Z., and Bates D. W. (2016) Making Predictive Analytics a
Routine Part of patient Care.
Accessed 5 January 2018.
5. Song, F., Guo, Z. and Mei, D.: Feature selection using principal component analysis.
In System science, engineering design and manufacturing informatization (ICSEM),
international conference on IEEE, Vol. 1, pp. 27-30 (2010).
6. Jović, A., Brkić, K., & Bogunović, N.: A review of feature selection methods with
applications. In Information and Communication Technology, Electronics and
Microelectronics (MIPRO), IEEE 38th International Convention, pp. 1200-1205 (2015).
7. Agrawal, R., and Srikant, R.: Fast algorithms for mining association rules. In Proc. 20th
int. conf. very large data bases, VLDB. Vol. 1215, pp. 487-499 (1994).
8. Sharma, R., Singh, S.N. and Khatri, S.: Medical data mining using different classification
and clustering techniques: a critical survey. In Computational Intelligence &
Communication Technology (CICT), Second International Conference on IEEE. pp. 687-
691 (2016).
9. Abdullah, U., Ahmad, J., & Ahmed, A.: Analysis of effectiveness of apriori algorithm in
medical billing data mining. In Emerging Technologies, ICET, 4th International
Conference on IEEE, pp. 327-331 (2008).
10. Ilayaraja, M. and Meyyappan, T.: Efficient Data Mining Method to Predict the Risk of
Heart Diseases through Frequent Itemsets. Procedia Computer Science, 70, pp.586-592
11. Stilou, S., Bamidis, P. D., Maglaveras, N., & Pappas, C.: Mining association rules from
clinical databases: an intelligent diagnostic process in healthcare. Studies in health
technology and informatics, (2), pp. 1399-1403 (2001).
12. Kaushik, S., Choudhury A., Mallik K., Moid A., and Dutt V.: Applying Data Mining to
Healthcare: A Study of Social Network of Physicians and Patient Journeys. In Machine
Learning and Data Mining in Pattern Recognition, pp. 599-613. Springer International
Publishing, New York (2016).
13. Janecek, A., Gansterer, W., Demel, M. and Ecker, G.: On the relationship between feature
selection and classification accuracy. In New Challenges for Feature Selection in Data
Mining and Knowledge Discovery, pp. 90-105 (2008).
14. Motoda, H. and Liu, H.: Feature selection, extraction and construction. Communication of
IICM (Institute of Information and Computing Machinery, Taiwan) Vol, 5, pp.67-72
15. Kaushik, S., Choudhury, A., Dasgupta, N., Natarajan, S., Pickett, L. A., & Dutt, V.:
Evaluating Frequent-Set Mining Approaches in Machine-Learning Problems with Several
Attributes: A Case Study in Healthcare. In International Conference on Machine Learning
and Data Mining in Pattern Recognition, pp. 244-258. Springer, Cham (2018).
16. Liu, C., Wang, W., Zhao, Q., Shen, X., & Konan, M.: A new feature selection method
based on a validity index of feature subset. Pattern Recognition Letters, 92, pp. 1-8,
17. Jain, D., & Singh, V.: Feature selection and classification systems for chronic disease
prediction: A review. Egyptian Informatics Journal (2018).
18. Harb, H. M., & Desuky, A. S.: Feature selection on classification of medical datasets
based on particle swarm optimization. International Journal of Computer Applications,
104(5) (2014).
19. Lee, I. H., Lushington, G. H., & Visvanathan, M.: A filter-based feature selection
approach for identifying potential biomarkers for lung cancer. Journal of clinical
Bioinformatics, 1(1), 11, (2011).
20. Hall, M. A.: Correlation-based feature selection for machine learning (1999).
21. Fonti, V., & Belitser, E.: Feature selection using lasso. VU Amsterdam Research Paper in
Business Analytics (2017).
22. Quinlan, J.R.: Induction of decision trees. Machine learning, 1(1), pp.81-106 (1986).
23. Langley, P. and Sage, S.: Induction of selective Bayesian classifiers. In Proceedings of the
Tenth international conference on Uncertainty in artificial intelligence. Morgan Kaufmann
Publishers Inc., pp. 399-406 (1994).
24. Peng, C.Y.J., Lee, K.L. and Ingersoll, G.M.: An introduction to logistic regression analysis
and reporting. The journal of educational research, 96(1), pp.3-14 (2002).
25. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J. and Scholkopf, B.: Support vector
machines. IEEE Intelligent Systems and their applications, 13(4), pp.18-28 (1998).
26. Guo, X., Minai, A. A., & Lu, L. J.: Feature selection using multiple auto-encoders. IEEE
International Joint Conference on Neural Networks (IJCNN) pp. 4602-4609 (2017).
27. Kamkar, I., Gupta, S. K., Phung, D., & Venkatesh, S.: Stable feature selection for clinical
prediction: Exploiting ICD tree structure using Tree-Lasso. Journal of biomedical
informatics, 53, pp.277-290 (2015).
28. Hoque, N., Bhattacharyya, D. K., & Kalita, J. K.: MIFS-ND: A mutual information-based
feature selection method. Expert Systems with Applications, 41(14), pp.6371-6385 (2014).
... When calculating mutual information scores, the following steps and formulas are used. In predicting the unknown class of samples, entropy and conditional entropy principles can be used to assess a given feature's efficiency [15,17,18]. e entropy of H (X) for the values {x 1 , x 2 ,..., x n } can be given as ...
... In Lasso, the penalty is defined as the total of the absolute values of the coefficients, which is L1. us, the primary goal of Lasso is to reduce an absolute value (L1 penalty) to zero rather than to use the sum of squares (L2 penalty) [18,22]. e following steps explain the procedures of selecting the optimum feature subset using Lasso. ...
... is form of regression aids in the reduction of variation caused by variable multicollinearity. It contributes to the removal of variation caused by nonlinear correlations between two independent variables [18]. e following steps describe how to use Ridge to choose the optimal feature subset. ...
Full-text available
Heart disease is recognized as one of the leading factors of death rate worldwide. Biomedical instruments and various systems in hospitals have massive quantities of clinical data. Therefore, understanding the data related to heart disease is very important to improve prediction accuracy. This article has conducted an experimental evaluation of the performance of models created using classification algorithms and relevant features selected using various feature selection approaches. For results of the exploratory analysis, ten feature selection techniques, i.e., ANOVA, Chi-square, mutual information, ReliefF, forward feature selection, backward feature selection, exhaustive feature selection, recursive feature elimination, Lasso regression, and Ridge regression, and six classification approaches, i.e., decision tree, random forest, support vector machine, K-nearest neighbor, logistic regression, and Gaussian naive Bayes, have been applied to Cleveland heart disease dataset. The feature subset selected by the backward feature selection technique has achieved the highest classification accuracy of 88.52%, precision of 91.30%, sensitivity of 80.76%, and f-measure of 85.71% with the decision tree classifier.
Full-text available
Electroencephalography (EEG) is an electrical signal data that can describe brain activity in which the signal contains important information that can be used to detect several diseases. One of the diseases that can be detected by EEG signals is stroke. The most common type of stroke is the acute ischemic stroke (AIS) due to blockage of blood supply to the brain which can generate the tissue damage in the brain EEG signal recording uses several electrodes where the more electrodes used in the recording, the greater the number of EEG features produced (high dimensional data). This can make it difficult for models of machine learning to have optimal performance on high-dimensional data. In this study, for optimizing the performance of the machine learning model by selecting features with the Least Absolute Shrinkage and Selection Operator (Lasso) method, where this method can select the relevant features by shrinking some coefficients to zero. The type of classification used in this study is random forest with features used for classification are Brain Symmetry Index (BSI), Delta-Alpha Ratio (DAR), Delta-Theta-Alpha-Beta Ratio (DTABR). The results showed that the Lasso method can optimize the performance of learning machines with an accuracy value of 75% with 24 features out of 45 features.
Full-text available
Chronic Disease Prediction plays a pivotal role in healthcare informatics. It is crucial to diagnose the disease at an early stage. This paper presents a survey on the utilization of feature selection and classification techniques for the diagnosis and prediction of chronic diseases. Adequate selection of features plays a significant role for enhancing accuracy of classification systems. Dimensionality reduction helps in improving overall performance of machine learning algorithm. The application of classification algorithms on disease datasets yields promising results by developing adaptive, automated and intelligent diagnostic systems for chronic diseases. Parallel classification systems can be used to expedite the process and to enhance the computational efficiency of results. This work presents a comprehensive overview of various feature selection methods and their inherent pros and cons. We then analyze adaptive classification systems and parallel classification systems for chronic disease prediction.
Conference Paper
Full-text available
Real-world data such as medical images and sensor measurements is usually high-dimensional and limited. Using such datasets directly in machine learning tasks can lead to poor generalization. Feature learning is a general approach for transforming high-dimensional data points to a representational space with lower dimensionality. Machine learning models can be trained efficiently with such representations. In this paper, a novel feature selection method based on multiple trained sparse auto-encoders (SAEs) is described. It works by selecting diverse, non-redundant features from multiple pinched SAEs with very narrow hidden layers, and then using these features in a more appropriately sized classifier without further feature tuning. The feature learning ability of the method is evaluated in a handwritten digits recognition task. Results show that this type of feature selection provides improved representations for a softmax classifier, and that using pinched SAEs produces results equal to or better than regular SAEs.
Full-text available
The wrapper feature selection method can achieve high classification accuracy. However, the cross-validation scheme of the wrapper method in evaluation phase is very expensive regarding computing resource consumption. In this paper, we propose a new statistical measure named as LW-index which could replace the expensive cross-validation scheme to evaluate the feature subset. Then, a new feature selection method, which is the combination of the proposed LW-index with Sequence Forward Search algorithm (SFS-LW), is presented in this paper. Further, we show through plenty of experiments conducted on nine UCI datasets that the proposed method can obtain similar classification accuracy as the wrapper method with centroid-based classifier or support vector machine, and its computation cost is approximate to the compared filter methods.
Full-text available
In 2004, the US President launched an initiative to make healthcare medical records available electronically [27]. T his initiative gives researchers an opportunity to study and mine healthcare data across hospitals, pharmacies, and physicians in order to improve the quality of care. Physicians can make better informed decisions regarding care of patients if physicians have proper understanding of patient journeys. In addition, physician healthcare decisions are influenced by their social networks. In this paper, we find patterns among patient journeys for pain medications from sickness to recovery or death. Next, we combine social network analysis and diffusion of innovation theory to analyze the diffusion patterns among physicians prescribing pain medications. Finally, we suggest an interactive visualization interface for visualizing demographic distribution of patients. T he main implication of this research is a better understanding of patient journeys via data-mining and visualizations; and, improved decision-making by physicians in treating patients.
Full-text available
Classification analysis is widely adopted for healthcare applications to support medical diagnostic decisions, improving quality of patient care, etc. A subset dataset of the extensive amounts of data stored in medical databases is selected for training. If the training dataset contains irrelevant features, classification analysis may produce less accurate and less understandable results. Feature subset selection is one of data preprocessing step, which is of immense importance in the field of data mining. This paper proposes the filter and wrapper approaches with Particle Swarm Optimization (PSO) as a feature selection methods for medical data. The performance of the proposed methods is compared with another feature selection algorithm based on Genetic approach. The two algorithms are applied to three medical data sets The results show that the feature subset recognized by the proposed PSO when given as input to five classifiers, namely decision tree, Naïve Bayes, Bayesian, Radial basis function and k-nearest neighbor classifiers showed enhanced classification accuracy over all given types of classification methods.
Full-text available
Data mining techniques are used in the field of medicine for various purposes. Mining association rule is one of the interesting topics in data mining which is used to generate frequent itemsets. It was first proposed for market basket analysis. Researchers proposed variations in techniques to generate frequent itemsets. Generating large number of frequent itemsets is a time consuming process. In this paper, the authors devised a method to predict the risk level of the patients having heart disease through frequent itemsets. The dataset of various heart disease patients are used for this research work. Frequent itemsets are generated based on the chosen symptoms and minimum support value. The extracted frequent itemsets help the medical practitioner to make diagnostic decisions and determine the risk level of patients at an early stage. The proposed method can be applied to any medical dataset to predict the risk factors with risk level of the patients based on chosen factors. An experimental result shows that the developed method identifies the risk level of patients efficiently from frequent itemsets.