ChapterPDF Available

Evaluating Auto-encoder and Principal Component Analysis for Feature Engineering in Electronic Health Records

Authors:

Abstract and Figures

Feature engineering is an important mechanism where we transform and represent the high-dimensional data into a lower-dimensional space. These representations can then be used to efficiently train machine-learning models. Auto-encoders are widely used in research for unsupervised feature learning. However, the application of auto-encoders for electronic health records (EHRs) containing features with binary values (binary-valued features) has been less studied. The primary objective of this research was to compare an auto-encoder with principal component analysis (PCA), a popular feature engineering technique, for feature selection in different (US and Indian) EHR datasets containing binary-valued features. The US dataset contained thousands of binary-valued features, and the Indian dataset contained nineteen binary-valued features. Results revealed that feature selection by the auto-encoder followed by different classification algorithms gave the highest accuracy on both the datasets compared to feature selection by PCA. We highlight the implications of using auto-encoders for learning features in EHR datasets.
Content may be subject to copyright.
Evaluating Auto-encoder and Principal Component
Analysis for Feature Engineering in Electronic Health
Records
Shruti Kaushik
1,a
, Abhinav Choudhury
1,b
, Nataraj Dasgupta
2,c
, Sayee Natarajan
2,d
,
Larry A. Pickett
2,e
, and Varun Dutt
1,f
1
Applied Cognitive Science Laboratory, Indian Institute of Technology Mandi, Himachal
Pradesh, India – 175005
2
RxDataScience, Inc., USA - 27709
a
shruti_kaushik@students.iitmandi.ac.in,
b
abhinav_choudhury@students.iitmandi.ac.in,
c
nd@rxdatascience.com,
d
sayee@rxdatascience.com,
e
larry@rxdatascience.com, and
f
varun@iitmandi.ac.in
Abstract. Feature engineering is an important mechanism where we transform
and represent the high-dimensional data into a lower-dimensional space. These
representations can then be used to efficiently train machine-learning models.
Auto-encoders are widely used in research for unsupervised feature learning.
However, the application of auto-encoders for electronic health records (EHRs)
containing features with binary values (binary-valued features) has been less
studied. The primary objective of this research was to compare an auto-encoder
with principal component analysis (PCA), a popular feature engineering
technique, for feature selection in different (US and Indian) EHR datasets
containing binary-valued features. The US dataset contained thousands of
binary-valued features, and the Indian dataset contained nineteen binary-valued
features. Results revealed that feature selection by the auto-encoder followed
by different classification algorithms gave the highest accuracy on both the
datasets compared to feature selection by PCA. We highlight the implications of
using auto-encoders for learning features in EHR datasets.
Keywords: Auto-encoders, principal component analysis, dimensionality
reduction, machine learning, classification, EHR, features.
1 Introduction
The use of electronic health records (EHRs) has increased among hospitals, clinics,
and patients' care settings [1]. These records may store patients' visit information,
demographic details, diagnoses, lab test results, and prescription information [2, 16].
In general, EHR datasets contain several features which may help accurately predict
healthcare outcomes.
Prior research has proposed several machine learning (ML) algorithms for
predicting different healthcare outcomes [10]. The accurate prediction of healthcare
outcomes, however, may require selection of relevant features (i.e., feature
engineering) in data. In fact, keeping irrelevant and redundant features could mislead
ML algorithms [5]. Thus, one may need to perform feature engineering before
implementing ML algorithms [5].
Some of the feature engineering techniques include filters, wrappers, and embedded
methods, where these methods help reduce the number of features in data [5].
However, beyond these techniques, there exist other techniques that may create new
features from the original features present in datasets [4]. For example, principal
component analysis (PCA) is a feature engineering technique that performs a linear
combination of original features to create a new set of features in lower dimensional
feature space [15]. Similarly, auto-encoders, a neural network with the same inputs
and outputs, is another feature engineering technique where new features are a non-
linear combination of original features [3]. Prior research has used different auto-
encoder-based approaches for features selection on image and magnetic resonance
image datasets [6-8]. Research has also compared auto-encoder-based and PCA-based
feature engineering techniques for classifying neuro-images [6]. The main advantage
of PCA and auto-encoder techniques over the filters, wrappers, and embedded
methods is that the former allows all original features to contribute to the transformed
features; whereas, the latter approaches may eliminate some of the original features
from datasets.
In the real-world, EHR datasets may contain several features with binary
absent/present values (i.e., binary-valued features) [11]. For example, EHRs may
contain diagnosis codes, procedure codes, and other demographic variables as absent
or present features corresponding to patients [11]. In fact, features in EHR datasets
could be converted into binary-valued features, where rows are linked to unique
patients and columns are the different binary (absent/present) features. Although
auto-encoder-based and PCA-based feature engineering techniques have been used in
healthcare-related image analyses; however, to the best of authors’ knowledge, a
comparison of these techniques on EHRs containing several binary-valued features
has not been explored in literature. In this research, we address this literature gap by
evaluating auto-encoder-based and PCA-based feature-engineering approaches on
real-world EHR datasets. Here, we compare different feature-engineering techniques
by evaluating their ability to classify records post feature engineering.
The primary objective of this paper is to evaluate PCA and auto-encoders in their
feature engineering capabilities across two EHR datasets containing binary-valued
features. One dataset involves the purchase of two pain medications in the US, and
the other dataset involves the purchase of five general-purpose medications in a large
district hospital in Himachal Pradesh, India. We perform classification of records post
feature engineering by relying upon three standard ML algorithms including, naive
Bayes classifier [12], logistic regression [13], and support vector machine (SVM)
[14].
In what follows, we first provide a brief review of related literature. Next, we
explain the methodology of applying various feature engineering techniques. In
Section IV, we present our experimental results and compare classification accuracies
post feature engineering using PCA and auto-encoders. Finally, we discuss our results
and conclude our paper by highlighting the main implications of this research and its
future scope.
2 Background
Prior research has evaluated PCA as a feature engineering technique in image
retrieval tasks [4]. For example, reference [4] compared PCA and linear discriminant
analysis (LDA) in content-based image retrieval task and found PCA to perform
better compared to LDA. Also, researchers have compared PCA with other feature
engineering techniques like the gain ratio, fuzzy rough features (FRF) selection, and
correlation-based feature selection in a breast cancer dataset [18]. Results show that
the FRF approach outperformed PCA in terms of better classification accuracy [18].
Prior research has also evaluated auto-encoders as a feature engineering technique
involving magnetic resonance images [6] and image datasets [7]. For example,
reference [6] used stacked auto-encoders for feature engineering and compared auto-
encoders with the LASSO-based methods, PCA, and two-sample t-test approaches for
prediction of Alzheimer’s disease from magnetic resonance images. Results revealed
better performance with auto-encoders compared to PCA [6]. Similarly, reference [7]
has used different variants of auto-encoders for feature engineering on popular image
datasets [7]. Moreover, researchers have also used various ML techniques like
decision tree, naive Bayes, SVM, neural networks, and regression approach for
performing classification post feature engineering [6, 11].
Although a number of image-based applications have utilized PCA-based and auto-
encoder-based feature engineering methods, an evaluation of these feature
engineering approaches has not been done on EHR datasets containing several binary-
valued features. Overall, on EHR datasets, we expect auto-encoders to perform better
compared to PCA in feature engineering and post classification. This expectation is
based upon prior literature presented above as well as the fact that auto-encoders are
nonlinear feature engineering techniques compared to PCA, which is a linear feature
engineering technique [6].
3
Method
3.1 Data
We used two datasets for feature engineering and subsequent classification. The first
dataset (І) was the Truven MarketScan® health dataset
1
containing patients’ insurance
claims in the US [16]. This dataset contained approximately 45,000 unique patients.
Between January 2011 and December 2015, these patients formed the following
consumer groups across two common pain medications: consumers of medicine A
(55.2% patients), consumers of medicine B 39.98% patients), and consumers of both
medicine A and B (4.82% patients).
2
The dataset contains patients’ demographic
variables (age, gender, region, and birth year), clinical variables (admission type,
diagnoses made, and procedures performed), the name of medicines, and medicines’
refill counts per patient. There were a total of 15,081 features (including class) present
against each patient in the Truven dataset. Out of these features, 15,075 features were
present/absent binary-valued diagnoses and procedure codes. The list of other 6 (non-
1 Truven Market scan dataset links paid claims and detailed patient information over time.
2 Due to the non-disclosure agreement, we have anonymized the original names of these medications
.
binary-valued) features is shown in Table 1. Five out of 6 features contained
demographic information and the last feature contained the class label.
We first separated the 15,075 features (diagnoses and procedure codes) from
the demographic features (listed in Table 1). We then applied PCA and auto-encoder
on 15,075 diagnoses and procedure codes to select the relevant features and then
combined the selected features with the other 6 non-binary-valued demographic
features. From 15,075 binary features, we transformed the number of features in the
following sequence: 1000, 5000, and 10000. We varied the number of features to
investigate whether the classification accuracy changed as one changed the number of
features. These new features, along with the 6 features (mentioned in Table 1) were
then used to classify patients into three consumer classes, as discussed above.
Table 1. Description of Input Features for Dataset (І)
Features Description
Gender
Male, Female
Age-group 0-17, 18-34, 35-44, 45-54, 55-64
Region
Northeast, northcentral, south, west, unknown
Type of admission
Surgical, medical, maternity and newborn, psych
and substance abuse, unknown
Refill count Count in number
Pain medication (Class)
A, B, Both
The second dataset (ІІ) was collected from a government hospital in Mandi district,
Himachal Pradesh, India. This dataset contained five general-purpose medications,
which were the top-most five medications prescribed by the doctors in this hospital.
The dataset contained approximately 30,000 unique patients who consumed five
medications (A’ to E’) between June 2016 and January 2018. The dataset contains
20% records of patients who consumed A’, 16.4% records of patients who consumed
B’, 19.2% records of patients who consumed C’, 18.5% records of patients who
consumed D’, and 25.9% records of patients who consumed E’. There were a total of
21 features in this dataset, including the class label (A’ to E’ corresponding to
different medications). Table 2 shows the description of these 21 features. Out of the
21 features, the first 19 features were binary-valued (see Table 2). Overall, both
datasets contained binary-valued features across a majority of their attributes with a
similar number of unique patients.
We performed feature engineering only on the first 19 binary-valued features
to select the relevant features. From these 19 binary-valued features, we transformed
the number of features in the following sequence: 5, 10, and 15. Then, we combined
the transformed features with the Quantity feature and the Class label (the last two
attributes in Table 2) to classify the patients according to the medication they
consumed (A’ to E’).
Table 2. Description of Input Features for Dataset (ІІ)
Features Description
Age-group (0-18) Contains binary value (0/1)
Age-group (19-39) Contains binary value (0/1)
Age-group (40-59) Contains binary value (0/1)
Age-group (60+) Contains binary value (0/1)
Male OPD Contains binary value (0/1)
Female OPD Contains binary value (0/1)
Medicine OPD Contains binary value (0/1)
Skin OPD Contains binary value (0/1)
Eye OPD Contains binary value (0/1)
ENT OPD Contains binary value (0/1)
Surgical OPD Contains binary value (0/1)
Orthopedic OPD Contains binary value (0/1)
Dental OPD Contains binary value (0/1)
Gyne OPD Contains binary value (0/1)
Psychiatry OPD Contains binary value (0/1)
Skin OPD Contains binary value (0/1)
Pediatrics OPD Contains binary value (0/1)
Emergency OPD Contains binary value (0/1)
Pulmonary Medicine Contains binary value (0/1)
Quantity Total number of capsules
Class Name of medicine
Note: 1 indicates that the patient belongs to a specific feature. OPD means Out Patient Department.
3.2 Principal Component Analysis (PCA)
PCA is a feature engineering technique that does not directly select features as present
in the dataset [15]. However, PCA aims to reduce the dimensionality of a dataset
containing several correlated features by transforming the original feature space into a
new feature space in which all the features are uncorrelated [15]. PCA finds the
principal components in data, where these components are the directions where the
data is most spread out or the directions with the most variance in data. The process of
finding the principal components is described below:
1. Take the complete dataset of    dimensions and discard the class
attribute such that our dataset becomes dimensional.
2. Calculate the mean for each dimension of the dataset.
3. Calculate the covariance matrix of the whole dataset.
4. Calculate the eigenvectors and the corresponding eigenvalues.
5. Sort the eigenvectors by decreasing eigenvalues and choose eigenvectors
with the largest eigenvalues to form a    dimensional matrix W.
6. Use this    eigenvectors matrix to transform the original features onto
the new subspace.
Thus, implementing PCA means finding the eigenvalues and eigenvectors of the
features’ correlation matrix in data [15]. Eigenvectors and eigenvalues exist in pairs.
Eigenvector gives the direction, and corresponding eigenvalues (which is a number)
tells how much variance is present in the data in that direction. For dataset (І), we
selected the top-most 1000, 5000, and 10000 eigenvectors and transformed the
features in the direction of these eigenvectors. After this step, the new features were
combined with the 6 other features (listed in Table 1) for the classification task. While
on dataset (ІІ), we selected the top-most 5, 10, and 15 eigenvectors and transformed
whole data in the direction of these eigenvectors. The new (transformed) features
were then combined with quantity and class features (listed in Table 2) to perform the
classification of medications consumed.
3.3 Auto-encoder
An auto-encoder is an unsupervised machine learning technique that can learn
representations from data [3]. Auto-encoders work by compressing the input into a
latent-space representation (lower dimensional space) and then reconstructing the
input from this representation.
Fig. 1. The architecture of an auto-encoder [3]
In an auto-encoder, we take an unlabeled dataset and frame it as a supervised learning
problem tasked with outputting
a reconstruction of the original input (Fig. 1).
The auto-encoder consists of two parts: encoder and decoder. Given the unlabeled
input dataset

, the encoder maps input    
to   
  ,
where are the total number of features in data and are the reduced number of
features in the latent space. The encoding process is defined as follows:
 
 
(1)
Where
is the encoding function,
is the weight matrix of the encoder,
is the
bias vector, and
is known as the latent representation. Once the input has been
encoded, the decoder tries to reconstruct the input form the latent representation
and maps it to the output !   
. The decoder process is defined as follows:
   "
#

$
  
$
(2)
Where "
#
is the decoding function,
$
is the weight matrix of the decoder, and
$
is
the bias vector. This network is then trained by minimizing the reconstruction
error,% !), which measures the differences between our original input and the
consequent reconstruction.
% 
& ' ( '
&
$
(3)
The stacked auto-encoder consists of multiple layers of nodes in which the outputs of
each layer are wired to the inputs of the successive layer (Fig. 2).
Fig. 2. The architecture of a stacked auto-encoder [3]
Given dataset (І)

)*+++
, the encoder maps input     
*+,*
to 

-..//0111 211 1 3451111. In case of dataset (ІІ)

6++++
, the
encoder maps input    
7
/0   
/-..//02 1 345 28 Once the
auto-encoder has been trained, we save the latent representation and combine it with
the other features to perform classification. For training the auto-encoder on both
datasets, we tried different batch sizes and finally used a batch size of 64.
Furthermore, 90% of data was used for training, and the remaining 10% data was
used for testing. The auto-encoder was trained for 50 epochs on both datasets to
obtain the encoded dimensions (new features). We used Adadelta as an optimizer and
mean square error as the optimizer function [19]. Table 3 shows the architecture of
the stacked auto-encoders used to obtain a different set of features. These
architectures were selected after a trial-and-error evaluation of the test loss from
different auto-encoder architectures (the objective is to minimize the test loss).
On dataset (І), the 1000 encoded dimensions were achieved with 15075 neurons in
the input layer, 8000, 4000, 2000, and 1000 neurons in the encoder layers, and 2000,
4000, 8000, and 15075 neurons in the decoder layers. The 5000 encoded dimensions
were achieved with 15075 neurons in the input layer, 11000, 7000, and 5000 neurons
in the encoded layers, and 7000, 11000, 15075 neurons in the decoded layers.
Similarly, the 10000 encoded dimensions were achieved with 15075 neurons in the
input layer, 13000, 10000 neurons in the encoded layers, and 13000, 15075 neurons in
the decoded layers.
On dataset (ІІ), the 5 encoded dimensions were achieved with 19 neurons in the
input layer, 16, 12, 8, and 5 neurons in the encoder layers, and 8, 12, 16, and 19
neurons in the decoder layers. The 10 encoded dimensions were achieved with 19
neurons in the input layer, 16, 13, and 10 neurons in the encoder layers, and 13, 16,
and 19 neurons in the decoder layers. Similarly, the 15 encoded dimensions were
achieved with 19 neurons in the input layer, 17, 15 neurons in the encoder layers, and
17, 19 neurons in the decoder layers.
Table 3. Description of Stacked Auto-encoder
Dataset Total features
(binary-valued)
Encoding
dimension
Total number of encoder
and decoder layers
Test Loss
І
ІІ
15075
19
1000
5000
10000
5
10
15
8 (4 encoder, 4 decoder)
6 (3 encoder, 3 decoder)
4 (2 encoder, 2 decoder)
8 (4 encoder, 4 decoder)
6 (3 encoder, 3 decoder)
4 (2 encoder, 2 decoder)
0.010
0.002
0.009
0.030
0.018
0.008
3.4 Classification Algorithms
We combined the new transformed features from PCA and auto-encoders with other
features in dataset I and II (see Tables 1 and 2). Both datasets were then divided into
two parts for classification using naive Bayes, logistic regression and SVM: 70% of
the data was used for training, and 30% of the data was used for testing.
3.4.1 Naive Bayes
Naive Bayes is a probabilistic classifier that is based on the Bayes theorem. It is
called naive because it assumes a strong independence assumption between features
[12]. It assumes that the value of a specific feature is independent of the value of any
other feature, given the target (class) label. Despite this assumption, naive Bayes has
been quite successful in solving practical problems in text classification, medical
diagnosis and system performance management [12]. The classifier attempts to
maximize the posterior probability in determining the class of a transaction.
Suppose, vector y = (9
,9
$
,…,9
) represent the features in the problem with n
denoting the total number of features and k be the possible number of classes :
;
.
Naive Bayes is a conditional probability model which can be decomposed as [12]:
:
;
9
<
=
>
? =
>
<
?
(4)
Under the independence assumption, the probabilities of the features are defined as
follows [20]:
:
;
9
< @  9
 :
;
A9
B
:
;

<
B
(5)
This most likely class is then picked based on the maximum a posteriori (MAP)
decision rule [20] as follows:
:
;
 CDC
;@E
:
;
A9
B
:
;
<
B
(6)
3.4.2 Logistic Regression
Logistic regression is a linear classifier which can be used for modeling the
relationship between one dependent binary variable (F) and one or more independent
variables (G) [13]. It models the posterior probabilities of the 2 classes in an instance.
Let represents the probability of occurrence of a class event (  HF 
9 IJKLKFMNOJKPQ3NNQ3RKQSTNNKNNM4U3V3QWK9 , which depends on independent
variables (G
 G
$
 @ G
. We use the following equation for modeling the probability:
 
X
YZ[Y\]\[Y^]^[_[Y`]`
aX
YZ[Y\]\[Y^]^[_[Y`]`
(7)
Where a
0
is the bias or intercept term, and C
 C
$
 @  C
are the coefficients for the
independent variables (G
 G
$
 @ G
. Since there are 3 classes in dataset (І) and 5
classes in dataset (ІІ), we performed one-vs-all classification using logistic regression.
One-vs-all classification was implemented by training multiple logistic regression
classifiers, one for each of the b classes in the training dataset. Hence, we trained 3
different logistic regression classifiers for dataset (І) and 5 different logistic
regression classifiers for dataset (ІІ). Once the one-vs-all predictions had been made
for all classes, the classifier picked the class with the highest probability.
3.4.3 Support Vector Machines
Support vector machines (SVM) are supervised classification techniques which can
handle large features space [14]. SVMs are the binary classifiers which can be utilized
for multi-class classification tasks as well. They build a hyperplane or a set of
hyperplanes in a high dimensional space which can be used for classification and
regression-based tasks. SVMs can classify linearly separable as well as non-linearly
separable data [14]. If the data is linearly separable, then SVM uses the linear
hyperplane to perform classification. However, for the non-linear data, rather than
fitting a non-linear curve, it transforms the data into high dimensional space to
perform classification. SVM uses the kernel functions, e.g., radial basis function
(RBF kernel) to transform data into the high dimensional plane for classifying the
non-linear data [14]. SVM uses gamma and C parameters to perform classification.
Gamma defines how far the influence of a single training example reaches, with low
values meaning 'far' and high values meaning 'close'. The gamma parameters can be
seen as the inverse of the radius of influence of samples selected by the model as
support vectors. The C parameter defines the cost of misclassification. A large C
gives low bias and high variance; whereas, a small C gives higher bias and low
variance. In this paper, we used SVM with the RBF kernel to classify the patients.
We chose gamma as 1/number of features while implementing SVM. We chose C = 1
in the SVM across both auto-encoder and PCA. The SVM also performed one-vs-all
classification for classifying the three classes on dataset (І) and five classes on dataset
(ІІ) (this process was similar to the one followed in logistic regression).
3.4.4 Random Chance Classification
We also ran the Monte Carlo [17] simulations 5000 times to generate a random guess
for each of the classes on both datasets. Since all the classes are not equally likely,
specifically in the case of dataset (І). Therefore, while running the Monte Carlo
simulations, we kept the same probability for each class as present in the actual
datasets. This was done to check if the classification algorithms gave better accuracy
than a random guess and by how much percentage the feature engineering techniques
helped in improving the accuracy.
4 Results
Fig. 3 and Fig. 4 show the accuracy from different classifiers on test data with all
features and with features from feature engineering using auto-encoder (AE) and PCA
(LR, SVM, NB, AVG, and RC refer to the logistic regression, support vector
machine, naive Bayes, average accuracy across all classifiers, and the random chance,
respectively). On dataset (І), the best accuracy (= 63.01%) was obtained with 10005
features on test data when these features were selected by the stacked auto-encoder
and classified by logistic regression. On dataset (І І), the best accuracy (= 63.08%)
was obtained with 16 features on test data when these features were selected by the
stacked auto-encoder and classified by SVM. On dataset (І), decreasing features from
15080 to 1005 (a 93% decrease) decreased the average accuracy from 55.11% to
52.49% in the worst case (a meagre 2.62% decrease). On dataset (ІІ), decreasing
features from 20 to 6 (a 70% decrease) decreased the average accuracy from 56.42%
to 47.91% in the worst case (a meagre 8.51% decrease). Among all classifiers, the
naive Bayes algorithm was most affected by the decrease in the number of features in
both datasets. Furthermore, the average accuracy of 5000 runs from the random
chance algorithm (Monte Carlo) came out to be 46.5% and 20% on dataset (І) and
(ІІ), respectively. Thus, both LR and SVM algorithms performed better compared to
the random chance algorithm (Monte Carlo). Overall, we witnessed only a small
reduction in the average accuracy due to feature reduction.
Fig. 3. Test classification accuracy on dataset (І)
Fig. 4. Test classification accuracy on dataset (ІІ)
57.7
50.75
49.47
52.64
46.5
59.7
59.7
38.07
52.49
46.5
55.96
57.79
50.88
54.88
46.5
57.8
56
48.5
54.1
46.5
63.01
57.95
41.47
54.14
46.5
57.73
57.5
40.7
51.98
46.5
56.9
56.97
51.47
55.11
46.5
0
10
20
30
40
50
60
70
80
90
100
LR SVM NB AVG RC LR SVM NB AVG RC LR SVM NB AVG RC
AE PCA All Features
Accuracy (%)
1005 5005 10005 15080
46.63
60.07
37.03
47.91
20
56.7
61.19
52.85
56.91
20
54.61
61.53
39.34
51.83
20
57.53
61.93
50.82
56.76
20
58.07
63.08
52.19
57.78
20
58.56
62.37
47.2
56.04
20
58.55
62.29
48.42
56.42
20
0
10
20
30
40
50
60
70
80
90
100
LR SVM NB AVG RC LR SVM NB AVG RC LR SVM NB AVG RC
AE PCA All Features
Accuracy (%)
6 11 16 20
5 Discussion and Conclusions
Feature engineering may be needed in EHR datasets with binary-valued features when
the number of such features is large as feature engineering may likely help reduce the
complexity of machine-learning algorithms [5]. The primary objective of this research
was to evaluate two popular features engineering techniques (PCA and auto-encoders)
for classifying patients according to their medicine consumption across two EHR
datasets involving several binary-valued features.
First, feature engineering using auto-encoders gave better accuracies
compared to feature engineering using PCA. A likely reason for this finding is that
neural networks are capable of learning non-linear relationships from data compared
to PCA, which is a linear feature selection technique [6]. Perhaps, the ability to learn
non-linear relationships led-to better feature engineering from auto-encoders
compared to PCA.
Second, we found that the naive Bayes classifier ’s accuracy was most
affected by feature engineering. A likely reason for this finding is that the naive Bayes
algorithm treats all the features independently and gives them equal importance. Thus,
decreasing the number of features dents the accuracy of this algorithm compared to
other algorithms that may not treat all features with equal weights.
Third, there was only a meagre decrease in the average accuracy across
classifiers after feature engineering on binary-valued attributes. Overall, this result is
promising, and it shows that feature engineering on large EHR datasets with binary-
valued features is an ecologically valid exercise. Furthermore, there were significant
improvements across LR and SVM algorithms compared to the Monte Carlo
simulations in both datasets. Again, these results show that classification using LR
and SVM approaches seems to be effective across both linear and non-linear feature-
engineering methods.
Finally, there are some other feature-engineering approaches, such as
XGBoost [20] and Bayesian belief networks [21]. Thus, as part of our future work, we
plan to extend our current investigation to these approaches on binary-valued EHR
datasets.
Acknowledgement. This project was supported by grants (awards:
#IITM/CONS/RxDSI/VD/16 and #
IITM/CONS/PPLP/VD/03) to Varun Dutt.
References
1. J. E and Y. N.: Electronic Health Record Adoption and Use among Office-based
Physicians in the U.S., by State: 2015 National Electronic Health Records Survey.
The Office of the National Coordinator for Health Information Technology, Tech.
Rep. (2016).
2. P. B. Jensen, L. J. Jensen, and S. Brunak: Translational genetics: Mining electronic
health records: towards better research applications and clinical care. Nature Reviews
- Genetics, Vol. 13, pp. 395–405 (2012).
3. I. Goodfellow, Y. Bengio, and A. Courville: Deep learning. MIT Press, (2016).
4. Shereena, V. B., & David, J. M.: COMPARATIVE STUDY OF
DIMENSIONALITY REDUCTION TECHNIQUES USING PCA AND LDA FOR
CONTENT BASED IMAGE RETRIEVAL. Computer Science & Information
Technology, pp. 41 (2015).
5. Motoda, H. and Liu, H.: Feature selection, extraction and construction.
Communication of IICM (Institute of Information and Computing Machinery,
Taiwan) Vol, 5, pp.67-72 (2002).
6. Shi, S., & Nathoo, F.: Feature Learning and Classification in Neuroimaging:
Predicting Cognitive Impairment from Magnetic Resonance Imaging. In 2018 4th
International Conference on Big Data and Information Analytics (BigDIA), IEEE, pp.
1-5 (2018).
7. Meng, Q., Catchpoole, D., Skillicom, D., & Kennedy, P. J.: Relational autoencoder
for feature extraction. In 2017 International Joint Conference on Neural Networks
(IJCNN), IEEE, pp. 364-371 (2017).
8. Shickel, B., Tighe, P. J., Bihorac, A., & Rashidi, P.: Deep EHR: a survey of recent
advances in deep learning techniques for electronic health record (EHR) analysis.
IEEE journal of biomedical and health informatics, 22(5), pp. 1589-1604 (2018).
9. Kaushik, S., Choudhury A., Mallik K., Moid A., and Dutt V.: Applying Data Mining
to Healthcare: A Study of Social Network of Physicians and Patient Journeys. In
Machine Learning and Data Mining in Pattern Recognition. Springer International
Publishing, New York, pp. 599-613. (2016).
10. Sharma, R., Singh, S.N. and Khatri, S.: Medical data mining using different
classification and clustering techniques: a critical survey. In Computational
Intelligence & Communication Technology (CICT), Second International Conference
on IEEE, pp. 687-691 (2016).
11. Kaushik, S., Choudhury, A., Dasgupta, N., Natarajan, S., Pickett, L. A., & Dutt, V.:
Evaluating Frequent-Set Mining Approaches in Machine-Learning Problems with
Several Attributes: A Case Study in Healthcare. In International Conference on
Machine Learning and Data Mining in Pattern Recognition, Springer, pp. 244-258
(2018).
12. Langley, P. and Sage, S.: Induction of selective Bayesian classifiers. In Proceedings
of the Tenth international conference on Uncertainty in artificial intelligence. Morgan
Kaufmann Publishers Inc., pp. 399-406 (1994).
13. Peng, C.Y.J., Lee, K.L. and Ingersoll, G.M.: An introduction to logistic regression
analysis and reporting. The journal of educational research, 96(1), pp.3-14 (2002).
14. Hearst, M.A., Dumais, S.T., Osuna, E., Platt, J. and Scholkopf, B.: Support vector
machines. IEEE Intelligent Systems and their applications, 13(4), pp.18-28 (1998).
15. Song, F., Guo, Z. and Mei, D.: Feature selection using principal component analysis.
In System science, engineering design and manufacturing informatization (ICSEM),
international conference on IEEE, Vol. 1, pp. 27-30 (2010).
16. Danielson, E.: Health research data for the real world: the MarketScan® Databases.
Ann Arbor, MI: Truven Health Analytics (2014).
17. Doubilet, P., Begg, C. B., Weinstein, M. C., Braun, P., & McNeil, B. J.: Probabilistic
sensitivity analysis using Monte Carlo simulation: a practical approach. Medical
decision making, 5(2), pp. 157-177 (1985).
18. El-Hasnony, I. M., El Bakry, H. M., & Saleh, A. A.: Comparative study among data
reduction techniques over classification accuracy. International Journal of Computer
Applications, 122(2) (2015).
19. Zhang, N., Lei, D., & Zhao, J. F.: An Improved Adagrad Gradient Descent
Optimization Algorithm. In 2018 Chinese Automation Congress (CAC), IEEE, pp.
2359-2362 (2018).
20. Zheng, H., Yuan, J., & Chen, L.: Short-term load forecasting using EMD-LSTM
neural networks with a Xgboost algorithm for feature importance evaluation.
Energies. 10(8), 1168 (2017).
21. Cheng, J., & Greiner, R.: Learning bayesian belief network classifiers: Algorithms
and system. In Conference of the Canadian Society for Computational Studies of
Intelligence. Springer, Berlin, Heidelberg, pp. 141-151 (2001).
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Accurate load forecasting is an important issue for the reliable and efficient operation of a power system. This study presents a hybrid algorithm that combines similar days (SD) selection, empirical mode decomposition (EMD), and long short-term memory (LSTM) neural networks to construct a prediction model (i.e., SD-EMD-LSTM) for short-term load forecasting. The extreme gradient boosting-based weighted k-means algorithm is used to evaluate the similarity between the forecasting and historical days. The EMD method is employed to decompose the SD load to several intrinsic mode functions (IMFs) and residual. Separated LSTM neural networks were also employed to forecast each IMF and residual. Lastly, the forecasting values from each LSTM model were reconstructed. Numerical testing demonstrates that the SD-EMD-LSTM method can accurately forecast the electric load.
Article
Full-text available
In 2004, the US President launched an initiative to make healthcare medical records available electronically [27]. T his initiative gives researchers an opportunity to study and mine healthcare data across hospitals, pharmacies, and physicians in order to improve the quality of care. Physicians can make better informed decisions regarding care of patients if physicians have proper understanding of patient journeys. In addition, physician healthcare decisions are influenced by their social networks. In this paper, we find patterns among patient journeys for pain medications from sickness to recovery or death. Next, we combine social network analysis and diffusion of innovation theory to analyze the diffusion patterns among physicians prescribing pain medications. Finally, we suggest an interactive visualization interface for visualizing demographic distribution of patients. T he main implication of this research is a better understanding of patient journeys via data-mining and visualizations; and, improved decision-making by physicians in treating patients.
Article
Full-text available
Nowadays, Healthcare is one of the most critical issues that need efficient and effective analysis. Data mining provides many techniques and tools that help in getting a good analysis for healthcare data. Data classification is a form of data analysis for deducting models. Mining on a reduced version of data or a lower number of attributes increases the efficiency of system providing almost the same results. In this paper, a comparative study between different data reduction techniques is introduced. Such comparison is tested against classification algorithms accuracy. The results showed that fuzzy rough feature selection outperforms rough set attribute selection, gain ratio, correlation feature selection and principal components analysis.
Conference Paper
Feature extraction becomes increasingly important as data grows high dimensional. Autoencoder as a neural network based feature extraction method achieves great success in generating abstract features of high dimensional data. However, it fails to consider the relationships of data samples which may affect experimental results of using original and new features. In this paper, we propose a Relation Autoencoder model considering both data features and their relationships. We also extend it to work with other major autoencoder models including Sparse Autoencoder, Denoising Autoencoder and Variational Autoencoder. The proposed relational autoencoder models are evaluated on a set of benchmark datasets and the experimental results show that considering data relationships can generate more robust features which achieve lower construction loss and then lower error rate in further classification compared to the other variants of autoencoders.
Article
The past decade has seen an explosion in the amount of digital information stored in electronic health records (EHR). While primarily designed for archiving patient clinical information and administrative healthcare tasks, many researchers have found secondary use of these records for various clinical informatics tasks. Over the same period, the machine learning community has seen widespread advances in deep learning techniques, which also have been successfully applied to the vast amount of EHR data. In this paper, we review these deep EHR systems, examining architectures, technical aspects, and clinical applications. We also identify shortcomings of current techniques and discuss avenues of future research for EHR-based deep learning.