ChapterPDF Available

Uncovering Discriminative Knowledge-Guided Medical Concepts for Classifying Coronary Artery Disease Notes: 31st Australasian Joint Conference, Wellington, New Zealand, December 11-14, 2018, Proceedings


Abstract and Figures

Text classification is a challenging task for allocating each document to the correct predefined class. Most of the time, there are irrelevant features which make noise in the learning step and reduce the precision of prediction. Hence, more efficient methods are needed to select or extract meaningful features to avoid noise and over fitting. In this work, an ontology-guided method utilizing the taxonomical structure of the Unified Medical Language System (UMLS) is proposed. This method extracts concepts of appeared phrases in the documents which relate to diseases or symptoms as features. The efficiency of this method is evaluated on the 2010 Informatics for Integrating Biology and the Bed-side (i2b2) data set. The obtained experimental results show significant improvement by the proposed ontology-based method on the accuracy of classification.
Content may be subject to copyright.
Uncovering Discriminative
Knowledge-Guided Medical Concepts
for Classifying Coronary Artery Disease
Mahdi Abdollahi1(B
), Xiaoying Gao1,YiMei
1, Shameek Ghosh2,
and Jinyan Li3
1Victoria University of Wellington, Wellington, New Zealand
{mahdi.abdollahi, xiaoying.gao, yi.mei}
2Medius Health, Sydney, Australia
3University of Technology Sydney, Sydney, Australia
Abstract. Text classification is a challenging task for allocating each
document to the correct predefined class. Most of the time, there are
irrelevant features which make noise in the learning step and reduce
the precision of prediction. Hence, more efficient methods are needed to
select or extract meaningful features to avoid noise and overfitting. In
this work, an ontology-guided method utilizing the taxonomical struc-
ture of the Unified Medical Language System (UMLS) is proposed. This
method extracts concepts of appeared phrases in the documents which
relate to diseases or symptoms as features. The efficiency of this method
is evaluated on the 2010 Informatics for Integrating Biology and the Bed-
side (i2b2) data set. The obtained experimental results show significant
improvement by the proposed ontology-based method on the accuracy
of classification.
Keywords: Coronary artery disease notes ·Text classification
Feature selection ·Conceptualization ·Ontology
1 Introduction
This paper proposes a method which applies ontology by referring to Unified
Medical Language System (UMLS) [1] for entity recognition, and then aggregates
frequent entities to create features. The proposed method is integrated with five
common text classification methods to answer the following research questions:
1. Whether the proposed method can reduce the number of features and keep
the meaningful features; and
2. Whether the proposed method can increase the accuracy in classification of
the targeted clinical text.
Springer Nature Switzerland AG 2018
T. Mitrovic et al. (Eds.): AI 2018, LNAI 11320, pp. 104–110, 2018.
Uncovering Discriminative Knowledge-Guided MC 105
By analyzing the previous work, it is noticeable that the majority of disease-
targeted systems have tended to develop static rule-based systems which require
human interventions every time the model is updated with new features. Such
systems are not scalable for practical machine learning purposes. Our system
allows an easier and flexible selection of different types of medical concepts to
enable automatic extraction of features or combinations and generation of a
prediction model.
2 Proposed Ontology Based Approach
One of the important points in text classification problems is to investigate the
domain of documents which should be classified and the domain of classes that
documents should be labeled with. This can help to select only related features
of the documents to the domain for training phase and improve the accuracy
of prediction for unseen documents. In the clinic text classification task all of
the documents are discharge notes of patients in medical domain. The candidate
class is whether a disease such as that Coronary Artery Disease (CAD) is present
or not. Our goal is to select features that have relations with the disease. In this
case, the performance of the learned model can be improved.
To achieve the above goal, our proposed algorithm employs the knowledge in
the 2010 Informatics for Integrating Biology and the Bedside (i2b2) data set [2]
and UMLS library. For this purpose, the MetaMap tool is used to extract all the
concepts of existing phrases for each document using the UMLS. As shown in
Fig. 1, the concepts extraction step is employed on both the training and the test
documents. Then, by considering the medical domain, the concept selection step
is performed on the obtained concepts. As a first step, two concepts are selected
among all the concepts: “Disease or Syndrome” and “Sign or Symptom”. By
following this way of concept selection, the meaningful concepts will be selected
which will assist the training phase to learn better in order to increase the
accuracy of classification.
2.1 Conceptualization
Two sentences are given below as a sample to show how MetaMap works on the
input notes and what output it provides in classification process.
“Hyperlipidemia: The patient’s Lipitor was increased to 80 mg q.d. A progress note in the
patient’s chart from her assisted living facility indicates that the patient has had shortness
of breath for one day.”
Figure 2shows a segment of the returned results from MetaMap. Table1
summaries the extracted concepts of detected meaningful phrases from the sam-
ple sentences using MetaMap. As can be observed, the phrase “hyperlipidemia”
belongs to “[Disease or Syndrome]” and “[Finding]” concepts. The phrase “short-
est of breath” is allocated to the “[Sign or Symptom]”, [Clinical Attribute] and
106 M. Abdollahi et al.
Fig. 1. The flowchart of the architecture of using MetaMap and UMLS for text classi-
“[Intellectual Product]” concepts. Considering the medical domain and the type
of the classes in the selected data set, we choose concepts that appear in the
“[Disease or Syndrome]” or “[Sign or Symptom]” categories. First we identify
these two categories which are in square brackets, then the phrase that is within
the round parentheses at the same line will be extracted as the main phrase. For
example, the phrase “Dyspnea” is extracted in line 19 of Fig. 2for the phrase
“shortness of breath”. After finishing the concept selection step, the obtained
phrases will be used instead of the original documents in the binary classifica-
tion problem. In order to give weights to the extracted terms of the documents,
TF-IDF is applied in the vectorization step and each document is represented
as a vector of weights based on the TF-IDF function.
Fig. 2. A segment of returned results of extracted concepts using MetaMap.
Uncovering Discriminative Knowledge-Guided MC 107
Table 1. The extracted concepts of example sentences using MetaMap.
Sentences Detected phrases Extracted concepts Selected
First sentence Hyperlipidaemia [Disease or Syndrome]
[Finding] ×
Patie nt [Patient or Disabled group] ×
Lipitor [Organic Chemical, Pharmacologic Substance] ×
80% [Quantitative Concept] ×
mg++ increased [Finding] ×
Second sentence Progress note [Clinical Attribute] ×
[Intellectual Product] ×
Patient chart [Manufactured Object] ×
Assisted living facility [Healthcare Related Organization, Manufactured Object] ×
Patie nt [Patient or Disabled group] ×
Shortness of breath [Sign or Symptom]
[Clinical Attribute] ×
[Intellectual Product] ×
One day [Temporal Concept] ×
2.2 Data Preprocessing and Labelling
The idea of the paper is tested on the 2010 i2b2 data set. This paper focuses on
binary classification, so all the documents are labeled based on whether or not
the Coronary Artery Disease (CAD) is present. Each document in the original
data set has three files consisting of “Concepts.con”, “Relations.rel”, and “Asser-
tions.ast” which were provided by the i2b2 organization for Relations Challenge.
We used the content of “Assertions.ast” file of each document to determine the
label of it. As shown in Fig. 3, there are a number of problem names inside each
Assertion file. To label all of the documents, at the first step, all the lines of the
file is searched for the “Coronary Artery Disease” phrase. If the phrase is found
by the search, the second step will be checking whether the disease is present or
not. If the name of illness appears with the phrase “present” in the same line, we
will consider that the document is in the CAD class. By following this rule, all
of the labels of 170 training documents and 256 test documents are extracted.
Fig. 3. A subpart of the Assertions file.
3 Results and Discussions
The performance of the proposed method is assessed on the 2010 i2b2 data set.
Among all the topics, class CAD is considered to form a binary classification.
108 M. Abdollahi et al.
Five popular classifiers are used in the experimental comparison. The classifiers
are Naive Bayes, Linear Support Vector Machine (SVM), K-Nearest Neighbor
(KNN), Decision Tree and Logistic Regression. The performance of the classifiers
are evaluated based on three main metrics (Precision, Recall, F1-measure) using
micro-average and macro-average.
Some of the parameters of these classifiers are turned to get better results.
For this purpose, the number of the neighbors in the KNN is set to 28 for the
“n neighbors” parameter. In the Decision Tree classifier, the maximum depth of
the tree and the random number generator are set to 14 for the “max depth”
and 11 for the “random state” parameters, respectively. The inverse of regular-
ization strength in the Logistic Regression is set to “1e1” for the “C” parameter.
Furthermore, early stopping rule is selected to avoid overfitting in training Lin-
ear SVM and Logistic Regression classifiers. Other parameters of the classifiers
are their default values.
Table 2compares the obtained micro-average and macro-average results of
the classifiers without using MetaMap and with using MetaMap. The best results
are highlighted in the table. It can be concluded from the experimental results
that the accuracies of all classifiers are increased significantly after applying the
proposed method. In Table 2, K-Nearest Neighbor using MetaMap achieved bet-
ter performance (with 94.86% accuracy) in comparison with the other classifiers
in micro-average results (F1-measure metric).
Table 2. The obtained results for the 2010 i2b2 data set.
Method Without MetaMap With MetaMap
Precision Recall F1-measure Precision Recall F1-measure
Micro-average results
Naive Bayes 77.47 77.47 77.47 81.42 81.42 81.42
Linear SVM 87.35 87.35 87.35 93.28 93.28 93.28
KNN 84.98 84.98 84.98 94.86 94.86 94.86
Decision tree 85.77 85.77 85.77 90.12 90.12 90.12
Logistic regression 86.96 86.96 86.96 92.89 92.89 92.89
Macro-average results
Naive Bayes 50.55 50.20 48.33 68.50 62.21 64.00
Linear SVM 84.44 70.66 74.67 91.07 86.28 88.41
KNN 85.33 62.01 65.08 91.92 91.24 91.58
Decision tree 77.17 74.47 75.67 82.78 91.51 85.93
Logistic regression 86.39 68.02 72.31 92.38 83.64 87.15
By analyzing the two F1-measure columns of micro-average results in Table 2
as the classification accuracy, Naive Bayes and Decision Tree classifiers are
improved approximately 4% using the proposed method. Furthermore, Linear
Uncovering Discriminative Knowledge-Guided MC 109
SVM and Logistic Regression achieved 6% more precision. The biggest improve-
ment is achieved by K-Nearest Neighbor (10%). Overall, all of the learned models
by utilizing the concept of phrases instead of the original documents achieved on
average a 6.1% improvement in classifying the 2010 i2b2 data set. Moreover, the
number of features has been reduced from 7554 to 788 by the conceptualization
approach, which is about 90% reduction.
To further evaluate our approach, instead of the original training-testing split
given by the data set, we used 10-fold cross validation. We shuffle the documents
and run the experiment 30 times, and each time is 10-fold cross validation. We did
significance test using the experiment results of the 30 runs. Table 3details the
mean and the standard deviation of the suggested method with MetaMap and
the method without MetaMap over the i2b2 data set. The classification accuracy
is the average of 30 times 10-fold cross validation test. The Wilcoxon signed
ranks test is applied to check whether the proposed method has made significant
difference in classification accuracy. According to Table 3, “T” column shows the
significance test of the without MetaMap method against the suggested method,
where “+” implies the proposed technique is significantly more accurate, “=”
implies no significant difference, and “” implies significantly less accurate.
Table 3. Comparison of classification accuracy and standard deviation averages using
30 independent runs. The highlighted entries are significantly better (Wilcoxon Test,
Dataset Classifier Without Highest Mean With Highest Mean T
MetaMap (Lowest STD) MetaMap (Lowest STD)
2010 i2b2 Naive Bayes 80.49 ±0.055 81.34(0.036) 84.26 ±0.053 85.64(0.029) +
Linear SVM 88.96 ±0.046 89.49(0.031) 92.56 ±0.038 93.08(0.016) +
KNN 86.76 ±0.051 87.80(0.023) 91.61 ±0.039 92.82(0.028) +
Decision Tree 90.36 ±0.037 92.60(0.016) 89.14 ±0.042 91.39(0.029) =
Logistic Regression 88.51 ±0.047 89.02(0.027) 92.63 ±0.038 93.32(0.021) +
From Table 3, it can be concluded that the proposed method is able to achieve
considerably higher classification accuracy than the other method. Our approach
gains significantly better classification accuracy in four cases. Only in the case
of Decision Tree classifier, the method shows not significantly difference of clas-
sification accuracy.
For further analyzing the methods, we checked the outputs and detected
two documents with names ”0101.txt” and ”0302.txt” and label CAD which all
the classifiers in the method without MetaMap have been labeled incorrectly,
whereas all of the classifiers in the proposed method have been labeled correctly.
By checking carefully the documents, we found two main reasons for this case.
The first reason is that our work decreases the number of noisy data significantly.
It assists classifiers to learn better. The second reason is that the new method
110 M. Abdollahi et al.
maps phrases to their concepts which are meaningful and most of the time shorter
than the original phrases. Since all the words in the documents stand alone as
features, a phrase consists of more than one word will lose its meaning.
4 Conclusion and Future Work
The current study proposed a medical ontology driven feature engineering app-
roach to reduce the number of features as well as persist with meaningful fea-
tures. In conjunction with the MetaMap tool, we map meaningful phrases in
medical text to specific UMLS medical concepts. The related concepts to the
problem domain are selected as features. The number of features is reduced sig-
nificantly by selecting ”Disease or Syndrome” and ”Sign or Symptom” concepts,
which are the most important in the domain of clinical notes. Experimental and
statistical results show that the suggested approach can accomplish significantly
better classification accuracy.
As our future work, we will consider relations between diseases and symp-
toms, and include the ones that are interconnected as pairs [3]. Furthermore, we
are planning to use concepts of sentences instead of phrases as features, hopefully
to further reduce the number of features and increase the accuracy. We will find
temporal relations between events to increase the classification accuracy. Finally,
all of the suggested ideas will apply on other data sets for further analysis.
1. Unified Medical Language System (UMLS R
umls/. Last updated 20 April 2016
2. Uzuner, ¨
O.: Recognizing obesity and comorbidities in sparse data. J. Am. Med. Inf.
Assoc. 16(4), 561–570 (2009)
3. Ernst, P., Siu, A., Weikum, G.: KnowLife: a versatile approach for constructing
a large knowledge graph for biomedical sciences. BMC Bioinform. 16(1), 157–169
... In this investigation, these approaches show better efficiency for some diseases such as cardiovascular, oncology and gastroenterology. Authors in [8] suggested an ontology-guided approach by employing the Unified Medical Language System (UMLS) to extract the concepts of available meaningful phrases inside documents. They considered diseases or symptoms as features to classify coronary artery disease notes. ...
Conference Paper
Full-text available
Medical document classification is one of the prominent research problems in document classification domain. As medical discharge notes are collected from real patients, they are often imbalanced. Moreover, these datasets are usually too small for data-hungry models (specially in rare disease cases). Both of these issues can lead to poor classification performance. In this work a new probabilistic dictionary-based data augmentation approach is proposed to address these issues by oversampling on the minority class. This method works by creating new documents with high variety by using the extracted synonyms from WordNet with awareness of synonyms’ similarities with the original word. To verify the effectiveness of the proposed oversampling approach, three different machine learning methods are used to learn classifiers from the augmented clinical text datasets generated by the oversampling approach. The experimental results show that the proposed method not only provides better classification accuracy than the imbalanced dataset case, but also can outperform some existing augmentation methods on the dataset of 2008 Integrating Informatics with Biology and the Bedside (I2B2) obesity challenge.
Extracting meaningful features from unstructured text is one of the most challenging tasks in medical document classification. The various domain specific expressions and synonyms in the clinical discharge notes make it more challenging to analyse them. The case becomes worse for short texts such as abstract documents. These challenges can lead to poor classification accuracy. As the medical input data is often not enough in the real world, in this work a novel ontology-guided method is proposed for data augmentation to enrich input data. Then, three different deep learning methods are employed to analyse the performance of the suggested approach for classification. The experimental results show that the suggested approach achieved substantial improvement in the targeted medical documents classification.
Full-text available
Document classification (DC) is one of the broadly investigated natural language processing tasks. Medical document classification can support doctors in making decision and improve medical services. Since the data in document classification often appear in raw form such as medical discharge notes, extracting meaningful information to use as features is a challenging task. There are many specialized words and expressions in medical documents which make them more challenging to analyze. The classification accuracy of available methods in medical field is not good enough. This work aims to improve the quality of the input feature sets to increase the accuracy. A new three-stage approach is proposed. In the first stage, the Unified Medical Language System (UMLS) which is a medical-specific dictionary is used to extract the meaningful phrases by considering disease or symptom concepts. In the second stage, all the possible pairs of the extracted concepts are created as new features. In the third stage, Particle Swarm Optimisation (PSO) is employed to select features from the extracted and constructed features in the previous stages. The experimental results show that the proposed three-stage method achieved substantial improvement over the existing medical DC approaches.
Full-text available
Background: Biomedical knowledge bases (KB's) have become important assets in life sciences. Prior work on KB construction has three major limitations. First, most biomedical KBs are manually built and curated, and cannot keep up with the rate at which new findings are published. Second, for automatic information extraction (IE), the text genre of choice has been scientific publications, neglecting sources like health portals and online communities. Third, most prior work on IE has focused on the molecular level or chemogenomics only, like protein-protein interactions or gene-drug relationships, or solely address highly specific topics such as drug effects. Results: We address these three limitations by a versatile and scalable approach to automatic KB construction. Using a small number of seed facts for distant supervision of pattern-based extraction, we harvest a huge number of facts in an automated manner without requiring any explicit training. We extend previous techniques for pattern-based IE with confidence statistics, and we combine this recall-oriented stage with logical reasoning for consistency constraint checking to achieve high precision. To our knowledge, this is the first method that uses consistency checking for biomedical relations. Our approach can be easily extended to incorporate additional relations and constraints. We ran extensive experiments not only for scientific publications, but also for encyclopedic health portals and online communities, creating different KB's based on different configurations. We assess the size and quality of each KB, in terms of number of facts and precision. The best configured KB, KnowLife, contains more than 500,000 facts at a precision of 93% for 13 relations covering genes, organs, diseases, symptoms, treatments, as well as environmental and lifestyle risk factors. Conclusion: KnowLife is a large knowledge base for health and life sciences, automatically constructed from different Web sources. As a unique feature, KnowLife is harvested from different text genres such as scientific publications, health portals, and online communities. Thus, it has the potential to serve as one-stop portal for a wide range of relations and use cases. To showcase the breadth and usefulness, we make the KnowLife KB accessible through the health portal (
In order to survey, facilitate, and evaluate studies of medical language processing on clinical narratives, i2b2 (Informatics for Integrating Biology to the Bedside) organized its second challenge and workshop. This challenge focused on automatically extracting information on obesity and fifteen of its most common comorbidities from patient discharge summaries. For each patient, obesity and any of the comorbidities could be Present, Absent, or Questionable (i.e., possible) in the patient, or Unmentioned in the discharge summary of the patient. i2b2 provided data for, and invited the development of, automated systems that can classify obesity and its comorbidities into these four classes based on individual discharge summaries. This article refers to obesity and comorbidities as diseases. It refers to the categories Present, Absent, Questionable, and Unmentioned as classes. The task of classifying obesity and its comorbidities is called the Obesity Challenge. The data released by i2b2 was annotated for textual judgments reflecting the explicitly reported information on diseases, and intuitive judgments reflecting medical professionals' reading of the information presented in discharge summaries. There were very few examples of some disease classes in the data. The Obesity Challenge paid particular attention to the performance of systems on these less well-represented classes. A total of 30 teams participated in the Obesity Challenge. Each team was allowed to submit two sets of up to three system runs for evaluation, resulting in a total of 136 submissions. The submissions represented a combination of rule-based and machine learning approaches. Evaluation of system runs shows that the best predictions of textual judgments come from systems that filter the potentially noisy portions of the narratives, project dictionaries of disease names onto the remaining text, apply negation extraction, and process the text through rules. Information on disease-related concepts, such as symptoms and medications, and general medical knowledge help systems infer intuitive judgments on the diseases.