Content uploaded by Varun Dutt
Author content
All content in this area was uploaded by Varun Dutt on May 08, 2019
Content may be subject to copyright.
Comparative Analysis of Features Selection Techniques
for Classification in Healthcare
Shruti Kaushik1,a, Abhinav Choudhury1,b, Ashutosh Kumar Jatav2,c, Nataraj
Dasgupta3,d, Sayee Natarajan3,e, Larry A. Pickett3,f, and Varun Dutt1,g
1Applied Cognitive Science Laboratory, Indian Institute of Technology Mandi, Himachal
Pradesh, India – 175005
2Indian Institute of Technology Jodhpur, Rajasthan, India - 342037
3RxDataScience, Inc., USA - 27709
ashruti_kaushik@students.iitmandi.ac.in,
babhinav_choudhury@students.iitmandi.ac.in,
cjatav.1@iitj.ac.in, dnd@rxdatascience.com,
esayee@rxdatascience.com, flarry@rxdatascience.com,
and gvarun@iitmandi.ac.in
Abstract. Analyzing high-dimensional data is a major challenge in the field of
data mining. Features selection is an effective way to remove irrelevant
information from the data. Prior research has utilized the Apriori frequent-set
mining approach to discover the relevant and interrelated features in the health
domain. However, the comparison of the Apriori algorithm with other features
selection approaches is absent in the literature. This paper aims to compare the
state-of-the-art features selection techniques with Apriori in the presence of
thousands of features in a healthcare dataset. After the features are selected we
perform a t hree-class classification using a number of machine-learning
algorithms where patients are classified according to the pain medication they
consume. Results revealed that among LASSO, ridge regression, PCA,
information gain, Apriori, and correlation-based f eatures selection techniques,
LASSO followed by classification gave the highest accuracy. We highlight the
implications of using feature-selection algorithms before classification in
healthcare datasets.
Keywords: Features selection, Dimensionality reduction, Machine Learning, Healthcare,
High dimensional dataset.
1 Introduction
In the past few years, there have been significant developments in how machine
learning (ML) can be used in various industries and research [1]. In recent years, ML
algorithms have also been utilized in the healthcare sector [2]. The existence of
electronic health records (EHRs) has allowed researchers to apply ML algorithms for
public health management, diagnoses, disease treatments, and for analyzing patients’
medical history [3, 12]. Mining hidden patterns in healthcare data sets could help
healthcare providers and pharmaceutical companies timely plan quality healthcare for
patients in need [12].
In general, healthcare datasets are large, and they may contain several thousands of
attributes [15]. To predict healthcare outcomes accurately, ML algorithms need to
focus on selecting relevant features from data [4]. However, the presence of
thousands of attributes in data may become problematic for classification algorithms
for processing these attributes as several features may involve large memory usage
and high computational costs [13]. Generally, two feature-engineering techniques
have been suggested in the literature to address the problem of datasets possessing a
large number of features: feature reduction (dimensionality reduction) and feature
selection [14]. Feature reduction technique reduces the number of attributes by
creating new combinations of attributes; whereas, feature selection techniques include
and exclude attributes present in data without changing them [14]. In the past,
algorithms like principal component analysis (PCA) have been used for feature
reduction [5]. Also, filter methods (e.g., information gain, correlation coefficients
score), wrapper methods (e.g., recursive features elimination), and embedded methods
(e.g., LASSO, ridge regression) have been used for features selection in the past [6].
Another way for feature selection is by using frequent item-set mining algorithms
(e.g., Apriori algorithm) [15]. Apriori selects a subset of features by looking at the
associations among items while selecting frequent item-sets [7].
Kaushik et. al, [15] compared the classification accuracies of various algorithms
when all features were present and when only selected features from Apriori
algorithm were present in classification models. Classification accuracies were higher
in data after Apriori algorithm compared to data where all features were present [15].
Although there has been prior research on comparing a number of features selection
techniques [6]; however, a comparison of these feature selection techniques with the
Apriori algorithm has been less explored. In this paper, we address this gap in the
literature and compare the performance of the Apriori algorithm for feature selection
with various other feature selection algorithms like PCA, information gain,
correlation coefficients score, LASSO, and ridge regression.
Specifically, we take a healthcare dataset involving consumption of two pain
medications in the US, and we apply different feature selection techniques including
the Apriori algorithm for feature selection and subsequent classification. Most of the
attributes in our dataset are the unique diagnoses and procedures associated with
patients. These diagnoses and procedures make sparse data values as there may be
variability present among patients in terms of the diagnoses and procedures associated
with each patient. Also, diagnoses and procedures might be inter-related in the dataset
which may help certain feature-selection algorithms compared to the Apriori
algorithm. Furthermore, to get confidence in our results, we try to solve a 3-class
classification task by applying several ML algorithms such as the decision tree [22],
Naïve Bayes classifier [23], logistic regression [24], and support vector machine
(SVM) [25] on the selected features.
In what follows, we first provide a brief review of related literature. Next, we
explain the methodology of applying various features selection techniques. In Section
IV, we present our experimental results and compare classification accuracies after
selecting features from different features selection techniques. Finally, we discuss our
results and conclude our paper by highlighting the main implications of this research
and its future scope.
2 Background
Researchers have applied data mining algorithms to mine the hidden knowledge in the
healthcare domain [2, 8, 12]. For example, various data-mining approaches such as
clustering, classification, and association-rule mining have made a significant
contribution to medical research [8, 12].
Prior research has utilized information gain and certain hybrid approaches for
features selection and developed classification systems for predicting chronic diseases
[17]. Harb & Desuky have used correlation based features selection approach to
improve the classification of medical datasets [18]. Kamkar et al. have used the
LASSO approach for features selection and compared it with information gain to
build clinical prediction models [27]. Also, Fonti and Belitser have compared LASSO
and ridge regression to perform features selection on high dimensional datasets [21].
Moreover, researchers have also evaluated PCA for features selection in image
recognition tasks [5]. These researchers have used various ML techniques like J48,
Naïve Bayes, SVM, Bayesian Networks, neural networks, and K-nearest neighbor
approach for performing classification on the selected features set [15-18].
Furthermore, researchers have used frequent-set mining algorithms like Apriori to
find the associations between diagnosis and treatment [9]. Besides, certain researchers
have focused on identifying frequent diseases using Apriori [10], and others have
discovered frequently appearing diagnoses and procedures for consumers of certain
pain medications [15]. Some researchers have used Apriori to reveal the unexpected
patterns in diabetic data repositories [11] and to predict the risks of heart diseases
[10].
However, to the best of authors’ knowledge, prior research has yet to compare
Apriori algorithms (for selecting features) with the other well-known features
selection techniques to predict the healthcare outcomes. In this paper, we attend to
this literature gap and perform features selection using Apriori frequent-set mining
algorithm, information gain, correlation coefficient score, LASSO, ridge regression,
and PCA for classifying patients based on pain medications they consumed. To
compare these approaches, we take specific healthcare dataset where there are
potentially thousands of features to choose between, and then we investigated the
classification accuracies of different ML algorithms using the selected features set.
2.1 Feature Selection Methods
Feature selection methods have mainly three categories: wrapper, filter, and
embedded [17]. The wrapper methods assess subsets of variables according to their
usefulness to a given predictor. Filter methods use the general characteristics of data
itself and work separately from the learning algorithm. The filter methods use the
statistical correlation between a set of features and the target feature. The amount of
correlation between features and the target variable determine the importance of target
variable [17]. Filter based approaches are not dependent on classifiers and are usually
faster and more scalable. Also, they have low computational complexity. Examples of
filter methods includes information gain, correlation coefficient, and chi-square test.
Moreover, some learning algorithms perform features selection as part of their
overall operation. These include regularization techniques such as LASSO (L1
regularization) and ridge regression (L2 regularization). These two techniques are part
of the embedded approach for features selection.
Apart from these methods, PCA as a feature reduction technique could be used to
find relevant features in data [5]. PCA tries to find the direction of most variations in
data and then transforms the original attributes space to the new space of maximum
variance.
In this paper, we used the following features selection techniques: two filter
methods (information gain and correlation coefficients score), two embedded methods
(LASSO and ridge regression), one features reduction technique (PCA), and one
frequent item-set mining approach (Apriori) for finding the relevant features. We
investigated the classification accuracy of different ML algorithms using the features
selected from the approaches mentioned above.
2.1.1 Information Gain
Information gain (IG) can detect the features possessing the maximum information
based on a specific class [19]. It is one of the most important feature ranking methods,
which measures dependency between a feature and a class label. IG of feature
and class is calculated as:
(1)
Where is the entropy and is a measure of the uncertainty of a random
variable. If we assume that we have a two-class classification problem (y=0 and y=1
are the class labels) then and
are defined as:
(2)
(3)
In this technique, for each feature, we calculate information gain (IG)
independently, and top k features are selected as the final feature set.
2.1.2 Correlation-based Feature Selection (CFS)
CFS is a simple filter algorithm that ranks feature subsets and discovers the merit of a
feature or a subset of features according to a correlation based heuristic evaluation
function [20]. The purpose of CFS is to find subsets that contain features that are
highly correlated with the class and uncorrelated with each other. The redundant
features are excluded, as they will be highly correlated with one or more of the
remaining features. The acceptance of a feature will depend on the extent to which it
predicts classes in areas of the instance space not already predicted by other features
[20]. CFS’s feature-subset evaluation function is shown as follows:
Where is the heuristic “merit” of a feature subset S containing k features,
is the mean feature-class correlation (f ϵ s), and
is the average feature-
feature inter-correlation. This equation is, in fact, Pearson’s correlation, where all
variables have been standardized. The numerator can be thought of as indicating how
predictive of the class a group of features is; the denominator is an indication of how
much redundancy there is among them (features).
2.1.3 Least Absolute Shrinkage and Selection Operator (LASSO)
LASSO is a powerful method that performs mainly two tasks: regularization and
feature selection [21]. The LASSO method puts a constraint on the sum of the
absolute values of the model parameters; the sum has to be less than a fixed value
(upper bound). The method applies a shrinking (L1 regularization) process where it
penalizes the coefficients of the regression variables shrinking some of them to zero.
During features selection process the variables that still have a non-zero coefficient
after the shrinking process are selected to be part of the model. The goal of this
process is to minimize the prediction error. In practice, the tuning parameter λ, that
controls the strength of the penalty, assume great importance. Indeed, when λ is
sufficiently large then coefficients are forced to be exactly equal to zero, this way
dimensionality can be reduced. The larger is the parameter λ, the more number of
coefficients are shrunk to zero. On the other hand, if λ = 0 we have an OLS (Ordinary
Least Square) regression.
2.1.4 Ridge Regression
Ridge regression works by penalizing the magnitude of coefficients of features along
with minimizing the error between predicted and actual observations [21]. This is a
regularization technique like LASSO. It performs L2 regularization where it adds the
penalty equivalent to the square of the magnitudes of coefficients.
2.1.5 Principal Component Analysis (PCA)
PCA does not directly select the features; it is a dimension reduction technique. PCA
aims to reduce the dimensionality of a dataset that contains a large number of
correlated attributes by transforming the original attributes space to a new space in
which attributes are uncorrelated [5]. The algorithm then ranks the variation between
the original dataset and the new one. The transformed attributes with most variations
are saved, and the rest of attributes are discarded. It is also important to mention that
PCA is an unsupervised technique because it does not take into account the class
label.
2.1.6 Apriori Algorithm
The Apriori algorithm [7] is used for finding the frequent item-sets in a transaction
database. It uses an iterative level-wise approach to generate the frequent item-sets.
This algorithm works in the following steps:
1. The transactions in database D are scanned to determine frequent 1-itemsets,
that possess the minimum support, where support of an itemset X is defined
as the proportion of the transactions that contain the item-set X in the database
D.
2. Generate candidate k item-sets from joining two k-1 itemsets, , and
remove its infrequent subset.
3. Scan D to get support count for each k item-sets, .
4. The set of frequent k item-sets,, is then determined. results from support
count of candidate k-1 item-sets.
5. Back to step 2 until there is no candidate k+1 item-sets, .
6. Extract the frequent k item-sets, L = .
After selecting the relevant features from these features selection techniques, we
formed a classification problem. We applied the following ML algorithms to classify
our data: decision tree, Naïve Bayes, logistic regression, and support vector machine.
2.2 Machine-Learning Algorithms
2.2.1 Decision Tree
Decision tree is a classification algorithm that classifies class instances by sorting
them down the tree from root to the leaf node. Each node in the decision tree specifies
a test on an attribute of the instance, and each branch descending from the node
corresponds to one possible value of the attribute. Following assumptions are taken
into account while creating a decision tree [22]:
1. Initially, the complete set of training attributes is evaluated at the root node.
2. Categorical feature values are preferred to continuous ones. Continuous values
need to be discretized before building the model.
3. Attribute values are used to recursively distribute the records.
4. Entropy and gain is calculated for each attribute to decide their placement
within the decision tree.
The main challenge in a decision tree is selecting which attribute to select for each
node in the tree. Random selection of attributes for nodes leads to very low accuracy
[22]. We have used the information-gain measure to identify the attribute which can
be considered as the root node at each level.
Information Gain: Information gain is based on the concept of entropy from
information theory. We assume attributes to be categorical while using information
gain as an attribute selection criterion. Entropy is defined as [22]:
(5)
Where are fractions that add up to 1 and represent the percentage of
each class present in the child node that results from a split in the tree [22].
Furthermore, Information Gain is defined as:
Information Gain = Entropy (parent) - Weighted Sum of Entropy (children)
(6)
Where a is an attribute in data. Information gain (IG) calculates the expected
reduction in entropy due to sorting on the attribute. At any node, attributes with the
maximum value of information gain are preferred over other attributes.
2.2.2 Naïve Bayes
Naïve Bayes is probabilistic classifier that is based on the Bayes theorem. It is called
naïve because it assumes a strong independence assumption between features [23]. It
assumes that the value of a particular feature is independent of the value of any other
feature, given the class variable. Despite this assumption, Naïve Bayes has been quite
successful in solving practical problems in text classification, medical diagnosis and
system performance management [23]. The classifier attempts to maximize the
posterior probability in determining the class of a transaction.
Suppose, vector y = (,,…,) represents the features in the problem with n
denoting the total number of features and K be the possible number of classes .
Naïve Bayes is a conditional probability model which can be decomposed as [23]:
(7)
Under the independence assumption, the probabilities of the attributes are defined as
follows [23]:
(8)
This most probable class is then picked based on the maximum a posteriori (MAP)
decision rule [23] as follows:
(9)
2.2.3 Logistic Regression
Logistic regression is a linear classifier which is used to model the relationship
between one dependent binary variable and one or more independent variables [24]. It
models the posterior probabilities of the k number of classes in an instance. The
simple logistic regression is defined as:
(10)
Where y is the predicted output, a0 is the bias or intercept term, and a1 is the
coefficient for the input value (x).
2.2.4 Support Vector Machines
Support vector machines (SVM) are supervised classification techniques which are
accurate and robust even for small training samples [25]. Furthermore, they have the
ability to handle the large feature spaces. SVMs are the binary classifiers which can
be used for multi-class classification tasks as well. They build a hyperplane or a set of
hyperplanes in a high dimensional space which can be used for classification and
regression-based tasks. SVMs can classify linearly as well as non-linearly separable
data [25]. If the data is linearly separable, then SVM uses the linear hyperplane to
perform classification. However, for the non-linear data, rather than fitting non-linear
curve, it transforms the data into high dimensional space to perform classification.
SVM uses the kernel functions, e.g., radial basis function (RBF kernel) to transform
the data into the high dimensional plane for classifying the non-linear data [25]. For
better classifications, we optimize the support weights to minimize the objective
(error) function.
3 Method
3.1 Data
In this paper, we used the Truven MarketScan® health dataset containing patients’
insurance claims in the US [15]. The data set contains approximately 45,000 patients,
who consumed two pain medications, medicine A, medicine B, or both between
January 2011 and December 2015.1 The dataset contains patients’ demographic
variables (age, gender, region, and birth year), clinical variables (admission type,
diagnoses made, and procedures performed), the name of medicines, and medicines’
refill counts per patient. The dataset contains 55.20% records of patients who
consumed medicine A only, 39.98% records of medicine B only, and 4.82% records
for those patients who consumed both these medications. There were 15,081 attributes
present in total against each patient in this dataset, out of which 15,075 attributes were
diagnoses and procedure codes some of which were inter-related. We applied the
features selection algorithms (Information gain, correlation coefficients score,
LASSO, ridge regression, PCA, and Apriori) on 15,075 diagnoses and procedure
codes to select the relevant features and then combined the selected features along
with the other independent (6) features. Table 1 shows the list of 6-features that were
used along with the selected features from different features selection techniques in
different ML algorithms. The ML algorithms classified patients according to the
medications consumed by them, i.e., medicine A, medicine B, or both.
Table 1. Description of Input Features for Classification Problem
Features Description
Sex
Male
Female
Age group
0
-
1
7, 18
-
34, 35
-
44, 45
-
54,
55
-
64
Region
Northeast, northcentral, south, west,
unknown
Type of admission
Surgical, medical,
maternity and newborn,
psych and substance abuse, unknown
Refill count Count in number
Pain medication
A, B, Both
1 Due to a non-disclosure agreement, we have anonymized the actual names of these
medications.
3.2 Model Calibration
3.2.1 Features Selection
First, we performed the features selection using the Apriori algorithm. Apriori
algorithm gives the frequently appearing items in the dataset. With 3% support, we
found 9 frequently appearing diagnoses and procedures out of 15,075 diagnoses and
procedures using the Apriori algorithm. The 3% support was chosen after a sensitivity
analysis where the male-female ratio of frequently appearing diagnoses and
procedures were checked [15]. In order to compare the other features selection
techniques with the Apriori method, we selected the top 9 features from information
gain, CFS, LASSO, ridge regression, and PCA as well.
3.2.2 Information Gain
We calculated information gain for each feature for the output variable. Information
gain values vary from 0 (no information) to 1 (maximum information). Those features
that contribute more information will have a higher information gain value and can be
selected, whereas those that do not add much information will have a lower score and
can be removed. Furthermore, using the ranker search [28] method we obtained a
ranked list of top 9 attributes. The search method is the technique by which we try to
navigate different combinations of attributes in the dataset in order to arrive on a short
list of chosen features.
3.2.3 CFS
CFS calculates the Pearson correlation between each feature and the output variable
(class) and selects only those features that have a moderate-to-high positive or
negative correlation (close to -1 or 1) and drop those features with a low correlation (a
value close to zero). Similar to the information gain method, we used the ranker
search [28] approach to obtain a list of top 9 attributes.
3.2.4 LASSO
LASSO is a regularization and features selection method. As described in the section
above, the parameter λ controls the strength of the penalty [21]. The larger the amount
of λ, the greater is the shrinkage. We adjusted the value of λ in such a way that we get
exactly 9 most relevant attributes out of 15,075 diagnoses and procedures. The value
of λ in our paper is 0.0027.
3.2.5 Ridge Regression
As we know, ridge regression is similar to LASSO. However, for the same value of λ,
the coefficients cannot be equal to zero using ridge regression [21]. Therefore, we
ranked the attributes (based on the magnitude of their coefficients) given by ridge
regression for λ = 0.0027 and selected the top 9 attributes from them.
3.2.6 PCA
As explained above, PCA reduces dimensions by using the original features set. It
does not select the features as the other techniques discussed in this paper. As its
name says, PCA finds the principal components in the data. Principal components are
the directions where the data is most spread out or the directions with the most
variance. Implementing a PCA is just finding the Eigenvalues and Eigenvectors of the
data’s correlation matrix [5]. Eigenvectors and Eigenvalues exist in pairs. Every
Eigenvector has a corresponding Eigenvalue. Eigenvector gives the direction, and
corresponding Eigenvalues (which is a number) tells how much variance there is in
the data in that direction. In this paper, we selected the 9 principal components (9
directions or Eigenvectors) with the 9 highest Eigenvalues. This means that we have
transformed our data in the direction of 9 principal components. Furthermore, these 9
directions were able to cover the 15.85% variance of the whole data.
Table 2 shows the list of features selected from different features selection
techniques.
Table 2. Description of Selected Features from Different Feature Selection Techniques
Apriori Information Gain and
CFS* LASSO Ridge Regression
Total knee arthroplasty Blisters, epidermal loss Chronic pain
syndrome
Oxygen supplie s
rack
Osteoarthrosis secondary
lower leg Third-de gree perineal
laceration
Opioid type
dependence,
continuous
Closed fracture of
base of skull
without mention of
intra cranial injury,
unspecified state
of consciousness
Removal of foreign body
from eye Traumatic amputation of
arm and hand
Other chronic pain Artery bypass
graft
Total knee replacement Under cardiac
catheterization
Opioid type
dependence,
unspecified
Facial nerve injury
due to birth trauma
Osteoarthrosis primary
lower leg Arthropathy associated with
other endocrine and
metabolic disorders
Backache Allergic rhinitis
due to food
Osteoarthrosis
generalized lo wer leg Closed dislocation Diagnostic
Radiology
Procedures of Spine
and Pelvis
Cytopathology,
evaluation of fine
needle aspirate
Total hip arthroplasty Malignant neoplasm of
bladder
Pneumonia,
organism
unspecified
Treat thigh
fracture
Fasciolopsiasis Basal cell carcinoma of skin
of other and unspecified
parts of the face
Degeneration of
lumbar or
lumbosacral
intervertebral disc
Drowning and
submersion due to
other accident to
unspecified
watercraft
Total hip replace ment Unspecified malignant
neoplasm of skin
Tobacco use
disorder
Individual
Psychotherapy for
60 minutes
3.2.7 Machine Learning for Classification
For the ML analyses, the dataset was divided into two parts: 70% of the data was used
for training, and 30% of the data was used for testing. Our ML problem is a three-
class problem, where we classified a patient according to the medication
consumption. So, a patient can be classified under class A, class B or both. We used
the 9 features selected from different algorithms (see Table 2) along with the other 6
features (see Table 1) to train our ML models. Therefore, all the ML models were
trained with 15 features in total for classifying the patients into three classes.
4 Results
We applied various ML algorithms like Naïve Bayes, decision tree, logistic
regression, support vector machine (linear kernel), and support vector machine (radial
kernel) on our dataset and compared their classification accuracy. We used 6 different
features selection approaches in this paper. Fig. 1 shows the classification accuracy on
training data from different ML algorithms for the three-class classification problem.
Fig. 2 shows the classification accuracy on test data from different ML algorithms for
the three-class classification problem. The x-axis in Fig. 1 and Fig. 2 shows the
different features selection techniques and the y-axis shows the accuracy as a
percentage. We found that all the ML algorithms gave the best accuracy on test data
when their features were selected using the LASSO features selection approach. On
test data, the best accuracy of 59.04% was achieved from logistic regression with
features selected using LASSO. The second best accuracy of 58.56% was achieved
from SVM (radial kernel) with features selected using LASSO. The third best
accuracy of 57.5% was achieved from SVM (radial kernel) with features selected
using PCA.
Furthermore, on test data, the best accuracy with features selected using
information gain and CFS was 56.99% from SVM (radial kernel) algorithm.
*We obtained same features from Information gain and CFS techniques.
Similarly, on test data, the best accuracy with features selected using Apriori and
ridge regression was 56.97% from SVM (radial kernel) algorithm.
Fig. 1. The classification accuracy on training data from different ML algorithms.
Fig. 2. The classification accuracy on test data from different ML algorithms.
52.28
53.79
54.89
52.28
52.28
52.28
57.17
58.06
58.86
57.19
57.22
57.22
58.22
70.84
64.25
58.22
58.2
58.2
56.18
57.16
57.95
56.14
56.13
56.13
57.18
59.94
58.95
57.18
57.26
57.26
0
10
20
30
40
50
60
70
80
90
100
Apriori P CA LASSO Ridge
regression
Information
Gain
CFS
Accuracy (%)
Algorithms for Features Selection
Naïve Bayes
Logistic Regression
Decision Tree
SVM (Linear kernel)
SVM (Radial kernel)
51.99
52.5
54.64
51.99
51.99
51.99
56.9
57.35
59.02
56.89
56.98
56.98
56.56
53.64
57.48
56.56
56.71
56.71
55.77
56.86
57.78
55.77
55.81
55.81
56.97
57.5
58.56
56.97
56.99
56.99
0
10
20
30
40
50
60
70
80
90
100
Apriori PCA LASSO Ridge
regression
Information
Gain
CFS
Accuracy (%)
Algorithms for Features Selection
Naïve Bayes
Logistic Regression
Decision Tree
SVM (Linear kernel)
SVM (Radial kernel)
5 Discussion and Conclusions
Medical datasets contain multiple patient-related features. Most of these features are
the diagnoses or procedures that the patient has undergone throughout his treatment.
[15]. Several of these features could be inter-related or interdependent and can
influence the medication that they consume. However, in order to classify patients
according to the medicine they consume, we need to first select the right subset of
these diagnoses and procedures (features). There are various state-of-the-art features
selection techniques available in the literature [5-6]. All of these techniques follow
different mechanism to select the relevant features in the dataset. Furthermore,
researchers have also checked the potential of Apriori algorithm [15] to select the
frequently appearing diagnoses and procedures in the medical dataset. In this paper,
our primary objective is to compare the PCA, information gain, correlation
coefficients score, LASSO, and ridge regression with the Apriori algorithm to select
the relevant features before applying machine learning algorithms for classifying the
patients according to the type of medication they consume, i.e. medicine A, medicine
B, or both. There were 15,075 diagnoses and procedures for about 45,000 patients in
the dataset. We selected the top 9 most relevant features from all the features selection
techniques. After combining these 9 (selected) diagnoses and procedures with 6 other
demographic and clinical features (15 in total), we applied naïve Bayes, decision tree,
logistic regression, SVM (linear kernel), and SVM (radial kernel) to classify the
patients.
First, we found that all the ML algorithms had the highest accuracy when we used
LASSO method for feature selection. This result is likely because LASSO is an L1
regularization and regression technique, which creates a penalized model for having
too many variables in the model [21]. The consequence of imposing this penalty is to
reduce the coefficient values towards zero. This allows the less contributing variables
to have a coefficient close to zero or equal zero. Therefore, LASSO selects only
relevant features which have the maximum contribution towards predicting the class
variable. This could be a likely reason why the performance of the classifiers
improved with the features selected from LASSO.
Second, we found that the SVM (radial kernel) gave the highest accuracy (on test
data) when it used features selected from Apriori, PCA, ridge regression, information
gain and correlation coefficients score based techniques. Only in the case of LASSO,
logistic regression (59.02% classification accuracy on test data) performed better than
SVM (radial kernel; 58.56% classification accuracy on test data). However, the
difference in their classification accuracy is just marginal. One of the possible reasons
could be that SVM (radial kernel) performs well when the nature of data is non-linear.
Furthermore, prior research has compared the Apriori approach of features
selection with the case when all features are present [15]. Thus, Apriori was not
compared with other feature selection approaches. In this paper, we performed this
comparison and we found that Apriori performed similar to the ridge regression,
information gain, and CFS approaches. However, LASSO approach performed better
than Apriori approach on this dataset. From our findings, we conclude that it is a good
practice to perform features selection before applying machine learning. Furthermore,
LASSO may be used as a feature selection approach in datasets where we deal with
thousands of inter-related features.
In this paper, we compared the traditional features selection approaches on a
healthcare dataset involving several attributes. However, recent literature on deep-
learning has revealed the effectiveness of using different forms of autoencoders for
feature selection [26]. Thus, as part of our future work, we plan to extend our
investigation by applying different forms of autoencoders on this dataset. These ideas
form the immediate next steps in our machine-learning research program in the
healthcare domain.
Acknowledgment. The project was supported by grants (awards:
#IITM/CONS/PPLP/VD/03 and # IITM/CONS/RxDSI/VD/16) to Varun Dutt.
References
1. Bhardwaj, R., Nambiar, A. R., & Dutta, D.: A Study of Machine Learning in Healthcare.
In Computer Software and Applications Conference (COMPSAC), IEEE 41st Annual,
Vol. 2, pp. 236-241 (2017).
2. Oswal, S., Shah, G., and Student, P. G.: A Study on Data Mining Techniques on
Healthcare Issues and its uses and Application on Health Sector. International Journal of
Engineering Science. 13536 (2017).
3. Sharma, A., & Mansotra, V.: Emerging applications of data mining for healthcare
management-a critical review. In Computing for Sustainable Global Development
(INDIACom), IEEE International Conference, pp. 377-382. (2014).
4. Parikh R. B., Obermeyer Z., and Bates D. W. (2016) Making Predictive Analytics a
Routine Part of patient Care.
https://hbr.org/2016/04/making-predictive-analytics-a-routine-part-of-patient-care
Accessed 5 January 2018.
5. Song, F., Guo, Z. and Mei, D.: Feature selection using principal component analysis.
In System science, engineering design and manufacturing informatization (ICSEM),
international conference on IEEE, Vol. 1, pp. 27-30 (2010).
6. Jović, A., Brkić, K., & Bogunović, N.: A review of feature selection methods with
applications. In Information and Communication Technology, Electronics and
Microelectronics (MIPRO), IEEE 38th International Convention, pp. 1200-1205 (2015).
7. Agrawal, R., and Srikant, R.: Fast algorithms for mining association rules. In Proc. 20th
int. conf. very large data bases, VLDB. Vol. 1215, pp. 487-499 (1994).
8. Sharma, R., Singh, S.N. and Khatri, S.: Medical data mining using different classification
and clustering techniques: a critical survey. In Computational Intelligence &
Communication Technology (CICT), Second International Conference on IEEE. pp. 687-
691 (2016).
9. Abdullah, U., Ahmad, J., & Ahmed, A.: Analysis of effectiveness of apriori algorithm in
medical billing data mining. In Emerging Technologies, ICET, 4th International
Conference on IEEE, pp. 327-331 (2008).
10. Ilayaraja, M. and Meyyappan, T.: Efficient Data Mining Method to Predict the Risk of
Heart Diseases through Frequent Itemsets. Procedia Computer Science, 70, pp.586-592
(2015).
11. Stilou, S., Bamidis, P. D., Maglaveras, N., & Pappas, C.: Mining association rules from
clinical databases: an intelligent diagnostic process in healthcare. Studies in health
technology and informatics, (2), pp. 1399-1403 (2001).
12. Kaushik, S., Choudhury A., Mallik K., Moid A., and Dutt V.: Applying Data Mining to
Healthcare: A Study of Social Network of Physicians and Patient Journeys. In Machine
Learning and Data Mining in Pattern Recognition, pp. 599-613. Springer International
Publishing, New York (2016).
13. Janecek, A., Gansterer, W., Demel, M. and Ecker, G.: On the relationship between feature
selection and classification accuracy. In New Challenges for Feature Selection in Data
Mining and Knowledge Discovery, pp. 90-105 (2008).
14. Motoda, H. and Liu, H.: Feature selection, extraction and construction. Communication of
IICM (Institute of Information and Computing Machinery, Taiwan) Vol, 5, pp.67-72
(2002).
15. Kaushik, S., Choudhury, A., Dasgupta, N., Natarajan, S., Pickett, L. A., & Dutt, V.:
Evaluating Frequent-Set Mining Approaches in Machine-Learning Problems with Several
Attributes: A Case Study in Healthcare. In International Conference on Machine Learning
and Data Mining in Pattern Recognition, pp. 244-258. Springer, Cham (2018).
16. Liu, C., Wang, W., Zhao, Q., Shen, X., & Konan, M.: A new feature selection method
based on a validity index of feature subset. Pattern Recognition Letters, 92, pp. 1-8,
(2017).
17. Jain, D., & Singh, V.: Feature selection and classification systems for chronic disease
prediction: A review. Egyptian Informatics Journal (2018).
18. Harb, H. M., & Desuky, A. S.: Feature selection on classification of medical datasets
based on particle swarm optimization. International Journal of Computer Applications,
104(5) (2014).
19. Lee, I. H., Lushington, G. H., & Visvanathan, M.: A filter-based feature selection
approach for identifying potential biomarkers for lung cancer. Journal of clinical
Bioinformatics, 1(1), 11, (2011).
20. Hall, M. A.: Correlation-based feature selection for machine learning (1999).
21. Fonti, V., & Belitser, E.: Feature selection using lasso. VU Amsterdam Research Paper in
Business Analytics (2017).
22. Quinlan, J.R.: Induction of decision trees. Machine learning, 1(1), pp.81-106 (1986).
23. Langley, P. and Sage, S.: Induction of selective Bayesian classifiers. In Proceedings of the
Tenth international conference on Uncertainty in artificial intelligence. Morgan Kaufmann
Publishers Inc., pp. 399-406 (1994).
24. Peng, C.Y.J., Lee, K.L. and Ingersoll, G.M.: An introduction to logistic regression analysis
and reporting. The journal of educational research, 96(1), pp.3-14 (2002).
25. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J. and Scholkopf, B.: Support vector
machines. IEEE Intelligent Systems and their applications, 13(4), pp.18-28 (1998).
26. Guo, X., Minai, A. A., & Lu, L. J.: Feature selection using multiple auto-encoders. IEEE
International Joint Conference on Neural Networks (IJCNN) pp. 4602-4609 (2017).
27. Kamkar, I., Gupta, S. K., Phung, D., & Venkatesh, S.: Stable feature selection for clinical
prediction: Exploiting ICD tree structure using Tree-Lasso. Journal of biomedical
informatics, 53, pp.277-290 (2015).
28. Hoque, N., Bhattacharyya, D. K., & Kalita, J. K.: MIFS-ND: A mutual information-based
feature selection method. Expert Systems with Applications, 41(14), pp.6371-6385 (2014).