ArticlePDF Available

A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column

Authors:
  • SwitchPoint Ventures

Abstract and Figures

Each year it has become more and more difficult for healthcare providers to determine if a patient has a pathology related to the vertebral column. There is great potential to become more efficient and effective in terms of quality of care provided to patients through the use of automated systems. However, in many cases automated systems can allow for misclassification and force providers to have to review more causes than necessary. In this study, we analyzed methods to increase the True Positives and lower the False Positives while comparing them against state-of-the-art techniques in the biomedical community. We found that by applying the studied techniques of a data-driven model, the benefits to healthcare providers are significant and align with the methodologies and techniques utilized in the current research community.
Content may be subject to copyright.
Research Article Open Access
International Journal of
Biomedical Data Mining
ISSN: 2090-4924
I
n
t
e
r
n
a
t
i
o
n
a
l
J
o
u
r
n
a
l
o
f
B
i
o
m
e
d
i
c
a
l
D
a
t
a
M
i
n
i
n
g
Mingle, Biomedical Data Mining 2015, 4:1
http://dx.doi.org/10.4172/2090-4924.1000114
Volume 4 • Issue 1 • 1000114
Biomedical Data Mining
ISSN: 2090-4924 JBDM, an open access journal
A Discriminative Feature Space for Detecting and Recognizing Pathologies
of the Vertebral Column
Damian Mingle*
WPC Healthcare, 1802 Williamson Court I, Brentwood, USA
*Corresponding author: Damian Mingle, Chief Data Scientist, WPC
Healthcare, 1802 Williamson Court I, Brentwood, USA, Tel: 615-364-9660;
E-mail: dmingle@wpchealthcare.com
Received June 30, 2015; Accepted August 19, 2015; Published September 15,
2015
Citation: Mingle D (2015) A Discriminative Feature Space for Detecting and
Recognizing Pathologies of the Vertebral Column. Biomedical Data Mining 4: 114.
doi:10.4172/2090-4924.1000114
Copyright: © 2015 Mingle D. This is an open-access article distributed under the
terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and
source are credited.
Abstract
Each year it has become more and more difcult for healthcare providers to determine if a patient has a pathology
related to the vertebral column. There is great potential to become more efcient and effective in terms of quality
of care provided to patients through the use of automated systems. However, in many cases automated systems
can allow for misclassication and force providers to have to review more causes than necessary. In this study, we
analyzed methods to increase the True Positives and lower the False Positives while comparing them against state-
of-the-art techniques in the biomedical community. We found that by applying the studied techniques of a data-driven
model, the benets to healthcare providers are signicant and align with the methodologies and techniques utilized
in the current research community.
Keywords: Vertebral column; Feature engineering; Probabilistic
modeling; Pattern recognition
Introduction
Over the years there has been an increase in machine learning (ML)
techniques, such as Random Forrest (RF), Boosting (ADA), Logistic
(GLM), Decision Trees (RPART), Support Vector Machines (SVM),
and Articial Neural Networks (ANN) applied to many medical elds.
A signicant reason this has become the case is the capacity for human
beings to act as diagnostic tools over time. Stress, fatigue, ineciencies,
and lack of knowledge all become barriers to high- quality outcomes.
ere have been studies regarding applications of data mining in
dierent elds, namely: biochemistry, genetics, oncology, neurology
and EEG analysis. However, literature suggests that there are few
comparisons of machine learning algorithms and techniques in medical
and biological areas. Of these ML algorithms, the most common
approach to develop nonparametric and nonlinear classications is
based on ANNs.
In general, the numerous methods of machine learning that have
been applied can be grouped into two sets: knowledge-driven models
and data-driven models. e parameters of the knowledge-driven
models are estimated based on the expert knowledge of detecting and
recognizing pathologies of the vertebral column. On the other hand, the
parameters of data- driven models are estimated based on quantitative
measures of associations between evidential features within the data.
e classication models used in pathologies of the vertebral column
have been SVM.
Studies have shown that ML algorithms are more accurate than
statistical techniques, especially when the feature space is more
complex or the input datasets are expected to have dierent statistical
distributions [1]. ese algorithms have the potential to identify and
model the complex non-linear relationships between the features of the
biomedical data set collected by Dr. da Mota, namely: pelvic incidence
(PI), pelvic tilt (PT), lumbar lordosis angle (LLA), sacral slope (SS),
pelvic radius (PR), and grade of spondylolisthesis (GOS).
ese methods can handle a large number of evidential features that
may be important in detecting abnormalities in the vertebral column.
However, increasing the number of input evidential features may lead
to increased complexity and larger numbers of model parameters, and
in turn the model becomes susceptible to over tting due to the curse
of dimensionality.
is work aims to present medical decision support for those
healthcare providers who are working to diagnosis pathologies of the
vertebral column. is framework is comprised of three subsystems:
feature engineering, feature selection, and model selection.
Pathologies of the vertebral column
Vertebras, invertebrate discs, nerves, muscles, medulla, and
joints make up the vertebral column. e essential functions of the
vertebral column are as follows: (i) human body support (ii) protection
of the nervous roots and medulla spine; and (iii) making the body’s
movement possible [2].
e structure of the intervertebral disc can be injured due to small
or several small traumas in the column. Various pathologies can cause
intense pain, such as disc hernias and spondylolisthesis. Backaches can
be the results of complications that are caused within this complex
system. We briey characterize the biomechanical attributes that
represent each patient in the data set.
Patient characteristics: Dr. Henrique da Mota collected data on
310 patients from sagittal panoramic radiographies of the spine while
at the Centre Medico-Chirurgical de Readaptation des Massues placed
in Lyon, France [3]. 100 patients were volunteers that had no pathology
in their spines (labeled as ‘Normal’). e remainder of patients had disc
hernia (60 patients) or spondylolisthesis (150 patients).
Decision support for orthopedists is automated using ML
algorithms and techniques of real clinical cases that utilize the above
biomechanical attributes. Following, we compare many ML models
evaluated through this study.
Problem statement and standard solutions
Classication refers to the problem of categorizing observations
Citation: Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. Biomedical Data
Mining 4: 114. doi:10.4172/2090-4924.1000114
Page 2 of 7
Volume 4 • Issue 1 • 1000114
Biomedical Data Mining
ISSN: 2090-4924 JBDM, an open access journal
into classes. Predictive modeling uses samples of data for which the
class is known to generate a model for classifying new observations. We
are only interested in two possible outcomes: ‘Normal’ and ‘Abnormal’.
Complex datasets make it dicult not to misclassify some observations.
However, our goal was to minimize those errors using the receiver
operating characteristic (ROC) curve.
Literature suggests using an ordinal data approach for detecting
reject regions in combinations with SVM. In addition, selecting the
misclassication costs as follows: Clow cost when classifying a class as
reject and assign Chigh cost when misclassifying.
erefore, Reject=Clow/Chigh=wr is the cost of rejecting (normalized
by the cost of erring). e method accounts to account for the rejections
rate rate and the misclassication rate [2].
Description of the data
It is useful to understand the basic features of the data in our study.
Simple summaries about the sample and the measures, together with
graphical analysis, form a solid basis for our quantitative analysis of
the vertebral column dataset. We conducted univariate analysis which
identies the distribution, central tendency, and dispersion of the data.
e distribution table include the 1st and 3rd quartile, indicating
25% of the values that the observations demonstrate are less than or
greater than the values listed (Table 1).
Distributions: Distribution of Biomechanical Features in class is
specied in Figure 1.
Correlation: A correlation analysis provides insights into the
independence of the numeric input variables. Modeling oen assumes
independence, and better models will result when using independent
input variables. Below is a table of the correlations between each of the
variables (Table 2).
We made use of a Hierarchical dendogram to provide visual clues
to the degree of closeness between variables [4]. e hierarchical
correlation dendrogram produced here presents a view of the variables
of the dataset showing their relationships. e purpose is to eciently
locate groupings of variables that are highly correlated. e length of
the lines in the dendrogram provides a visual indication of the degree of
correlation. For example, shorter lines indicate more tightly correlated
variables (Figure 2).
e feature engineering and data replication method
We developed a method which we termed Feature Bayes. is
method makes use of a probabilistic model from synthetic data creation.
Additionally, the data has been feature engineered and further rened
through automated feature selection. In order to maximize prediction
accuracy we generated 54 additional features. We dene a row vector
as 𝐴=[a1 a2 … a6] using the original six features from the vertebral
column dataset. N is dened as the number of terms.
e features were constructed as follows:
‘Trim mean 80%’ calculates the mean taken by excluding a
percentage of data points from the top and bottom tails of a vector as
such
=
A
xN
ij
a
e
(1)
Information theory, ‘Entropy’, is the expected value of the
information contained in each message received [5] and is generally
constructed as
6
1
2
log
a
na
n
=
(2)
‘Range’ is known as the area of variation between upper and lower
limits and is generally dened as
𝐴max – 𝐴min (3)
We developed ‘Standard Deviation of A’ as a quantity calculated to
indicate the extent of Deviation for a group as a whole,
2
()Xx
n
σ
=
(4)
Pelvic_Incidence Pelvic_Tilt Lumbar_Lordosis_Angle Sacral_Slope Pelvic_Radius Degree_Spondylolisthesis
Minimum 26.15 -6.555 14 13.37 70.08 -11.058
1st quarter 45.7 10.759 36.64 33.11 110.66 1.474
Median 59.6 16.481 49.78 42.65 118.15 10.432
Mean 60.96 17.916 52.28 43.04 117.54 27.525
3rd quarter 74.01 21.936 63.31 52.55 125.16 42.81
Maximum 129.83 49.432 125.74 121.43 157.85 418.543
Class
Abnormal 145
Normal 72
Table 1: Descriptive statistics of sample data
Correlation summary using the 'Pearson' covariance
pelvic_radius pelvic_tilt degree_spondylolisthesis lumbar_lordosis_angle sacral_slope pelvic_incidence
pelvic_radius 1 0.01917945 -0.04701219 -0.04345604 -0.34769211 -0.2586922
pelvic_tilt 0.01917945 1 0.37008759 0.45104586 0.04615349 0.6307171
degree_spondylolisthesis -0.04701219 0.37008759 1 0.50847068 0.55060557 0.6478843
lumbar_lordosis_angle -0.04345604 0.45104586 0.50847068 1 0.53161132 0.6812879
sacral slope -0.34769211 0.04615349 0.55060557 0.53161132 1 0.8042957
pelvic_incidence -0.25869222 0.63071714 0. 64788429 0.68128788 0.80429566 1
*Note that only correlations between numeric variables are reported
Table 2: Pearson correlation matrix (Sample)
Citation: Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. Biomedical Data
Mining 4: 114. doi:10.4172/2090-4924.1000114
Page 3 of 7
Volume 4 • Issue 1 • 1000114
Biomedical Data Mining
ISSN: 2090-4924 JBDM, an open access journal
‘Cosine of A’ was generated to capture the trigonometric function
that is equal to the proportion of the adjacent side to an acute angle of
the hypotenuse,
cos A
(5)
‘Tangent of A’ was generated to capture the trigonometric
function equal to the proportion of the opposite side over the adjacent
side in a right triangle,
tan A
(6)
‘Sine of A’ was generated to capture the trigonometric function
that is equal to the relationship of the opposite side of a given angle to
the hypotenuse,
sin A
(7)
‘25th Percentile of A’ is the value of vector A such that 25% of the
relevant population is below that value,
25
25 *
100
th Percentile N

=

(8)
Figure 1: Distribution of Biomechanical Features in class.
Figure 2: Hierarchical dendogram of vertebral column (Sample).
Citation: Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. Biomedical Data
Mining 4: 114. doi:10.4172/2090-4924.1000114
Page 4 of 7
Volume 4 • Issue 1 • 1000114
Biomedical Data Mining
ISSN: 2090-4924 JBDM, an open access journal
‘20th Percentile of A’ is the value of vector A such that 20% of the
relevant population is below that value,
(9)
‘75th Percentile of A’ is the value of vector A such that 75% of the
relevant population is below that value
(10)
‘80th Percentile of A’ is the value of vector A such that 80% of the
relevant population is below that value,
(11)
‘Pelvic Incidence Squared’ was used to change the pelvic incidence
from a single dimension into an area. Many physical quantities are
integrals of some other quantity,
2
1
a
(12)
For each element of the row vector A we performed a square root
calculation that yields a denite quantity when multiplied by itself,
ij
a
(13)
For each element of the row vector A we created a ‘Natural Log of
𝑎i,j’, more specically a logarithm to the base of e
ln
ij
a
(14)
‘Sum of pelvic incidence and pelvic tilt’,
2
1
a
na
n
=
(15)
For each element of the row vector A we created a ‘Cubed’ value
of 𝑎𝑖j,,
‘Dierence of pelvic incidence and pelvic tilt’,
3
,ij
a
(16)
‘Dierence of pelvic incidence and pelvic tilt’,
a1-a2 (17)
‘Product of pelvic incidence and pelvic tilt’,
2
1
a
na
n
=
(18)
‘Sum of pelvic tilt andlumbar lordosis angle’,
3
2
a
na
n
=
(19)
‘Sum of lumbar lordosis angle and sacral slope’,
4
3
a
na
n
=
(20)
‘Sum of pelvic radius and degree spondylolisthesis’,
5
4
a
na
n
=
(21)
‘Dierence of pelvic tilt and lumbar lordosis angle’,
𝑎2 − 𝑎3 (22)
‘Dierence of lumbar lordosis angle and sacral slope’
𝑎3 – 𝑎4 (23)
‘Dierence of sacral slope and pelvic radius
𝑎4 – 𝑎5
Dierence of pelvic radius and degree spondylolisthesis’,
𝑎5 – 𝑎6 (25)
Quotient of pelvic tilt and pelvic incidence’,
2
1
a
a
(26)
‘Quotient of lumbar lordosis angle and pelvic tilt’,
3
2
a
a
(27)
‘Quotient of sacral slope and lumbar lordosis angle’,
4
3
a
a
(28)
‘Quotient of pelvic radius and sacral slope’,
5
4
a
a
(29)
‘Quotient of degree spondylolisthesis and pelvic radius’,
6
4
a
na
n
=
(30)
‘Sum of elements A’,
6
4
a
na
n
=
(31)
‘Average of A elements’,
A
x
...........(32)
‘Median of A elements’,
1
22
2
th th
nn
term term
Median
  
++
  
  
= (33)
‘Euler’s number raised to the power of ai,j’,
ij
a
e
(34)
Patient data generated with oversampling
e category ‘Normal’ was signicantly underrepresented in the
dataset. We employed the Synthetic minority oversampling technique
(SMOTE) [6]. We chose the class value ‘Normal’ to work with using
ve nearest neighbors to construct an additional 100 instances.
Algorithm SMOTE (T,N,k)
Input: Number of minority class samples T; Amount of SMOTE
N%; Number of nearest neighbors k
Output: (N/100) *T synthetic minority class samples
1. (* If N is less than 100%, randomize the minority class samples as
only a random percent of them will be SMOTEd*)
2. If N<100
3. then Randomize the T minority class samples
4. T=(N/100) * T
5. N=100
6. end if
7. N=(int)(N/100) (*e amount of SMOTE is assumed to be integral
multiples of 100.*)
8. k=Number of nearest neighbors
9. numattrs=Number of attributes
10. Sample[][]: array for original minority class samples
Citation: Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. Biomedical Data
Mining 4: 114. doi:10.4172/2090-4924.1000114
Page 5 of 7
Volume 4 • Issue 1 • 1000114
Biomedical Data Mining
ISSN: 2090-4924 JBDM, an open access journal
11. newindex: keeps a count of number of synthetic samples generated,
initialized to 0
12. Synthetic[][]: array for synthetic samples (*Compute k nearest
neighbors for each minority class sample only.*)
13. for i ← 1 to T
14. Compute k nearest neighbors for I, and save the indices in the
nnarray
15. Populate(N, i, nnaray)
16. end for Populate (N,i, nnarray) (*Function to generate the synthetic
samples*)
17. while N ≠ 0
18. Choose a random number between 1 and k, call it nn. is step
chooses one of the k nearest neighbors of i.
19. for attr ← 1 to numattrs
20. Compute: dif=Sample[nnarray[nn]] [attr] – Sample[i] [attr]
21. Compute: gap=random number between 0 and 1
22. Synthetic[new index][attr]+gap * dif
23. end for
24. newindex++
25. N=N – 1
26. end while
27. Return (*End of Populate*)
End of Pseudo-Code.
Variance captured while increasing feature space
In an eort to reduce the dimensionality further we opted to use
principal components analysis (PCA) to choose enough eigenvectors
to account for 0.95 of the variance of the sub-selected attributes [7].
We decided to standardize the data rather than center the data, which
allows PCA to be computed by the correlation matrix rather than the
covariance matrix. e maximum number of attributes to include
through this transformation was 10. We then choose 0.95 for the
value of variance covered. is allowed us to retain enough principal
components to account for the appropriate proportion of variance. At
the completion of this process we retained 288 components.
Automated feature selection methods
We utilized a supervised method to select features, a correlation-
based feature subset selection evaluator [7]. is method of evaluation
takes into account the value of a subset of features by analyzing the
individual predictive ability of each feature along with the degree of
sameness between them. e preference is to have low inter-correlation
while having subsets of features that are highly correlated. Furthermore,
we required that the algorithm iteratively add the highest correlated
features with the class given there was not an existing feature in a subset
that had a higher correlation with the feature being analyzed. We
determined that we would search the space of features subsets using
greedy hill climbing improved with a way of retracing. is retracing
was governed by an environment of consecutive non-improving nodes.
We set the direction of the search by starting with the empty set of
attributes and searching forward. Additionally we specied that ve
would be the number of consecutive non-improving nodes to allow
before terminating the search. is method selected 19 attributes from
the 60 features. Of those 19 features, only PT and GOS are original
data inputs, representing approximately 11%; the other 89% are feature
engineered (Table 3).
Evaluation and classier
We used the receiver operator characteristic curves (ROC) which
compare the false positive rate to the true positive rate. We can access
the trade-o of the number of observations that are incorrectly
classied as positives against the number of observations that are
correctly classied as positives.
Area Under the Curve’ (AUC) is the accuracy or total number of
predictions that were correct,
Accuracy=True positive+True Negative/True Positive+False
Negative+False Positive+True Negative
e misclassication rate or the error rate is dened as: Error
rate=1-accuracy
We use other metrics in conjunction with the error rate to help
guide the evaluation process,
namely Recall, Precision, False Positive Rate, True Positive Rate,
False Negative Rate, and F-Measure [8].
Recall is the Sensitivity or True Positive Rate and demonstrates the
ratio of cases that are positive and correctly identied,
Recall=True positive/True Positive+False Negative
e False Positive Rate is dened as the ratio of cases that were
negative and incorrectly classied as positive,
False Positive Rate=False Positive/False Positive+True Negative
e True Negative Rate or Specicity is dened as the ratio of cases
that were negative and classied correctly,
Number of Folds(%) Attribute
10 80th Percentile of A
10 Product of PI and PT
10 Sum of PR and GOS
10 PR Cubed
10 e pelvic tilt
10 e pelvic radius
10 e degree spondylolisthesis
30 PT
30 25th Percentile of A
60 Quotient of PT and PI
70 Square root of PT
90 GOS
90 Range of elements in A
100 Standard Deviation of elements A
100 20th Percentile of A
100 Sum of PR and GOS
100 Difference of PR and GOS
100 Quotient of PR and GOS
100 GOS Cubed
Table 3: Evaluation mode: 10 fold cross validation
Citation: Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. Biomedical Data
Mining 4: 114. doi:10.4172/2090-4924.1000114
Page 6 of 7
Volume 4 • Issue 1 • 1000114
Biomedical Data Mining
ISSN: 2090-4924 JBDM, an open access journal
True Negative Rate=True Negative/False Positive+True
Negative
e False Negative Rate is the proportion of positive cases that
were incorrectly classied as negative,
False Negative Rate=False Negative/True Positive+False
Negative
Precision is the ratio of the positive cases that were predicted and
classied correctly,
Precision=True positive/True positive+False Positive
F-Measure is computed using the harmonic mean and allows some
average of the information retrieval precision and recall metrics. e
higher the F-Measure value, the higher classication quality,
F-Measure=2(Precision × Recall/Precision+Recall)
We simplied the task for classication by using a Naïve Bayes
classier which assumes attributes have independent distributions, and
thereby estimate
P (d/c j)=p (d1 | cj) x p (d2 | cj) x … x p (dn | cj)
Essentially this is determining the probability of generating instance
d given class cj. e naïve bayes classier is oen represented as the
following graph which states that each class causes certain features with
a certain probability [9] (Figure 3).
In order to emphasize the benets of the incorporation of feature
engineering, feature selection, and PCA, we referenced prior research
using two standard learning models and the rejoSVM classier [2]. All
training and testing was uniformly applied as before.
Furthermore, we abandoned SVM as a base and instead choose to
show the value of incorporating our methods within a simple Naïve
Bayes algorithm [10-13]. Moreover, methods such as Feature Bayes may
be used as a decision support tool for healthcare providers, particularly
for those providers that have minimal resources or limited access to an
ongoing professional peer network [14-16] (Tables 4 and 5).
Methods that produce high true positives and low false positives
are ideal for medical settings. ese allow healthcare providers to have
a higher degree of condence in the diagnoses provided to patients
[17,18]. Given a small dataset, which is typical of biomedical datasets,
feature Bayes helps to maximize the predictive accuracy that could
benet the medical expert in future patient evaluations [19,20] (Table 6).
Conclusion
e analysis of the vertebral column data allowed us to incorporate
feature engineering, feature selection, and model evaluation
techniques. Given these new methods, we were able to provide a more
accurate way of classifying pathologies. e feature Bayes method
proved to be valuable by obtaining higher true positives and lower
false positives than traditional or more current methods such as revo
SVM. is makes it a useful method as a biomedical screening tool to
aide healthcare providers with their medical decisions. Further studies
should be developed surrounding the analysis of the feature Bayes
method. Moreover, a comparison of ensemble learning techniques
using feature Bayes could prove benecial.
References
1. Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD (2010) Local
causal and markov blanket induction for causal discovery and feature selection
for classication part i: Algorithms and empirical evaluation. The Journal of
Machine Learning Research 11: 171-234.
2. da Rocha Neto AR, Sousa R, Barreto GDA, Cardoso JS (2011) Diagnostic
of pathology on the vertebral column with embedded reject option. Pattern
Recognition and Image Analysis 6669: 588-595.
3. Berthonnaud E, Dimnet J, Roussouly P, Labelle H (2005) Analysis of the
sagittal balance of the spine and pelvis using shape and orientation parameters.
Journal of spinal disorders & techniques 18: 40-47.
4. Aghagolzadeh M, Soltanian-Zadeh H, Araabi B, Aghagolzadeh A (2007) A
hierarchical clustering based on mutual information maximization. Image
Processing 1: I 277- I 280.
5. Nguyen XV, Chan J, Romano S, Bailey J (2014) Effective global approaches
for mutual information based feature selection. In Proceedings of the 20th ACM
SIGKDD international conference on Knowledge discovery and data mining,
ACM.
6. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic
TP Rate FP Rate Precision Recall F-Measure ROC
Area Class
0.855 0.115 0.883 0.855 0.869 0.935 Abnormal
0.85 0.145 0.857 0.885 0.871 0.935 Normal
Weighted
Avg. 0.87 0.13 0.87 0.87 0.87 0.935
Table 4: Detailed accuracy by class (40% Train).
TP
Rate FP Rate Precision Recall F-Measure ROC
Area Class
0.894 0.029 0.977 0.894 0.933 0.985 Abnormal
0.971 0.106 0.872 0.971 0.919 0.985 Normal
Weighted
Avg. 0.927 0.062 0.932 0.927 0.927 0.985
Table 5: Detailed accuracy by class (80% Train)
Training Size Method Accuracy
40%
SVM (linear) 85
SVM (KMOD) 83.9
rejoSVM (wr=0.04) 96.5
Naïve Bayes (6-original data) 87.7
Naïve Bayes (60-transformed data) 81.8
Feature Bayes 93.5
80%
SVM (linear) 84.3
SVM (KMOD) 85.9
rejoSVM (wr=0.04) 96.9
Naïve Bayes (6-original data) 81.5
Naïve Bayes (60-transformed data) 77.2
Feature Bayes 98.5
Table 6: Comparison of the performance of different methods.
Figure 3: Naïve Bayes Classier.
Citation: Mingle D (2015) A Discriminative Feature Space for Detecting and Recognizing Pathologies of the Vertebral Column. Biomedical Data
Mining 4: 114. doi:10.4172/2090-4924.1000114
Page 7 of 7
Volume 4 • Issue 1 • 1000114
Biomedical Data Mining
ISSN: 2090-4924 JBDM, an open access journal
minority over-sampling technique. Journal of articial intelligence research 16:
321-357.
7. Hall MA (1999) Correlation-based feature selection for machine learning.
CiteSeerx
5M.
8. Powers DMW (2011) Evaluation: from precision, recall and F-measure to ROC,
informedness, markedness and correlation. CiteSeerx5M.
9. Zhang H (2004) The optimality of naive Bayes. AA 1: 3.
10. Alba E, García-Nieto J, Jourdan L, Talbi EG (2007) Gene selection in cancer
classication using PSO/SVM and GA/SVM hybrid algorithms. Evolutionary
Computation.
11. Bermingham ML, Pong-Wong R, Spiliopoulou A, Hayward C, Rudan I, et
al. (2015) Application of high-dimensional feature selection: evaluation for
genomic prediction in man. Scientic reports 5: 10312.
12. Brown G, Pocock A, Zhao MJ, Luján M (2012) Conditional likelihood
maximisation: a unifying framework for information theoretic feature selection.
The Journal of Machine Learning Research 13: 27-66.
13. Hand DJ, Yu K (2001) Idiot's Bayes—not so stupid after all? International
statistical review 69: 385-398.
14. Jordan A (2001) On discriminative vs. generative classiers: A comparison of
logistic regression and naive bayes. Advances in neural information processing
systems 14: 841-848.
15. López FG, Torres MG, Batista BM, Pérez JAM, Moreno-Vega JM (2006)
Solving feature subset selection problem by a parallel scatter search. European
Journal of Operational Research 169: 477-489.
16. Murty MN, Devi VS (2011) Pattern recognition: An algorithmic approach.
17. Neto ARR, Barreto GA (2009) On the application of ensembles of classiers to
the diagnosis of pathologies of the vertebral column: A comparative analysis.
Latin America Transactions, IEEE (Revista IEEE America Latina) 7: 487-496.
18. Rennie JD, Shih L, Teevan, J, Karger DR (2003) Tackling the poor assumptions
of naive bayes text classiers. ICML 3: 616-623.
19. Rish I (2001) An empirical study of the naive Bayes classier. In IJCAI 2001
workshop on empirical methods in articial intelligence, IBM, New York.
20. Yamada M, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High dimensional
feature selection by feature-wise kernelized lasso. Neural computation 26: 185-
207.

Supplementary resource (1)

... Machine Learning (ML) algorithms can be used to create objective models which then can be used to measure risk. These models are more complex but may be able to create more accurate risk predictions that should lead to improved diabetic patient outcomes [20]. Deep Learning (DL) algorithms have recently attracted a lot of interest in educational circles and commercialism because of their effective impact in various fields of research, such as speech recognition, natural language processing, and brain computer interface [16]. ...
... Machine learning models can be used to create objective models which then can be used to measure risk (Mingle, 2015). These models are more complex, but may be able to create more accurate risk predictions that should lead to improved diabetic patient outcomes. ...
Article
Full-text available
Hospital readmission is considered an effective measurement of care provided within healthcare. Being able to risk identify patients facing a high likelihood of unplanned hospital readmission in the next 30-days could allow for further investigation and possibly prevent the readmission. Current models, such as LACE, sacrifice accuracy in order to allow for end-users to have a straight forward and simple experience. This study acknowledges that while HbA1c is important, it may not be critical in predicting readmissions. It also investigates the hypothesis that using machine learning on a wide feature, making use of model diversity, and blending prediction will improve the accuracy of readmission risk predictions compared with existing techniques. A dataset originally containing 100,000 admissions and 56 features was used to evaluate the hypothesis. The results from the study are encouraging and can help healthcare providers improve inpatient diabetic care.
... Filtering methods are generally considered in an effort to spend the least time-to-prediction and can be used to decide which are the most informative features in relation to the biological target [2]. Filtering produces the degrees of correlation with a given phenotype and then ranks the markers in a given dataset. ...
Article
Full-text available
Identification of suitable biomarkers for accurate prediction of phenotypic outcomes is a goal for personalized medicine. However, current machine learning approaches are either too complex or perform poorly. Here, a novel feature detection and engineering machine-learning framework is presented to address this need. First, the Rip Curl process is applied which generates a set of 10 additional features. Second, we rank all features including the Rip Curl features from which the top-ranked will most likely contain the most informative features for prediction of the underlying biological classes. The top-ranked features are used in model building. This process creates for more expressive features which are captured in models with an eye towards the model learning from increasing sample amount and the accuracy/time results. The performance of the proposed Rip Curl classification framework was tested on omentum cancer data. Rip Curl outperformed other more sophisticated classification methods in terms of prediction accuracy, minimum number of classification markers, and computational time.
Article
Full-text available
In this study, we investigated the effect of five feature selection approaches on the performance of a mixed model (G-BLUP) and a Bayesian (Bayes C) prediction method. We predicted height, high density lipoprotein cholesterol (HDL) and body mass index (BMI) within 2,186 Croatian and into 810 UK individuals using genome-wide SNP data. Using all SNP information Bayes C and G-BLUP had similar predictive performance across all traits within the Croatian data, and for the highly polygenic traits height and BMI when predicting into the UK data. Bayes C outperformed G-BLUP in the prediction of HDL, which is influenced by loci of moderate size, in the UK data. Supervised feature selection of a SNP subset in the G-BLUP framework provided a flexible, generalisable and computationally efficient alternative to Bayes C; but careful evaluation of predictive performance is required when supervised feature selection has been used.
Article
Full-text available
Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.
Article
Full-text available
Folklore has it that a very simple supervised classification rule, based on the typically false assumption that the predictor variables are independent, can be highly effective, and often more effective than sophisticated rules. We examine the evidence for this, both empirical, as observed in real data applications, and theoretical, summarising explanations for why this simple rule might be effective. La tradition veunt qu'une règle très simple assumant l'independance des variables prédictives. une hypothèse fausse dans la plupart des cas, peut être très efficace, souvent même plus efficace qu'une méthode plus sophistiquée en ce qui concerne l'attribution de classes a un groupe d'objets. A ce sujet, nous examinons les preuves empiriques, et les preuves théoriques, e'est-a-dire les raisons pour lesquelles cette simple règle pourrait faciliter le processus de tri.
Article
Full-text available
The naive Bayes classifier greatly simplify learn-ing by assuming that features are independent given class. Although independence is generally a poor assumption, in practice naive Bayes often competes well with more sophisticated classifiers. Our broad goal is to understand the data character-istics which affect the performance of naive Bayes. Our approach uses Monte Carlo simulations that al-low a systematic study of classification accuracy for several classes of randomly generated prob-lems. We analyze the impact of the distribution entropy on the classification error, showing that low-entropy feature distributions yield good per-formance of naive Bayes. We also demonstrate that naive Bayes works well for certain nearly-functional feature dependencies, thus reaching its best performance in two opposite cases: completely independent features (as expected) and function-ally dependent features (which is surprising). An-other surprising result is that the accuracy of naive Bayes is not directly correlated with the degree of feature dependencies measured as the class-conditional mutual information between the fea-tures. Instead, a better predictor of naive Bayes ac-curacy is the amount of information about the class that is lost because of the independence assump-tion.
Conference Paper
Full-text available
http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf
Article
We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this, we adopt a different strategy than is usual in the feature selection literature--instead of trying to define a criterion, we derive one, directly from a clearly specified objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a definition of feature 'relevancy' and 'redundancy', our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and flexibility with small data samples.
Article
The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this letter, we consider a feature-wise kernelized Lasso for capturing nonlinear input-output dependency. We first show that with particular choices of kernel functions, nonredundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures such as the Hilbert-Schmidt independence criterion. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments for classification and regression with thousands of features.