Conference PaperPDF Available

The Role of Text Analytics in Healthcare: A Review of Recent Developments and Applications


Abstract and Figures

The implementation of Data Analytics has achieved a significant momentum across a very wide range of domains. Part of that progress is directly linked to the implementation of Text Analytics solutions. Organisations increasingly seek to harness the power of Text Analytics to automate the process of gleaning insights from unstructured textual data. In this respect, this study aims to provide a meeting point for discussing the state-of-the-art applications of Text Analytics in the healthcare domain in particular. It is aimed to explore how healthcare providers could make use of Text Analytics for different purposes and contexts. To this end, the study reviews key studies published over the past 6 years in two major digital libraries including IEEE Xplore, and ScienceDirect. In general, the study provides a selective review that spans a broad spectrum of applications and use cases in healthcare. Further aspects are also discussed, which could help reinforce the utilisation of Text Analytics in the healthcare arena.
Content may be subject to copyright.
The Role of Text Analytics in Healthcare: A Review of Recent
Developments and Applications
Mahmoud Elbattah1, Émilien Arnaud2, Maxime Gignon2 and Gilles Dequen1
1Laboratoire MIS, Université de Picardie Jules Verne, Amiens, France
2Emergency Department, Amiens-Picardy University, Amiens France
{mahmoud.elbattah, gilles.dequen}, {arnaud.emilien, maxime.gignon}
Keywords: Text Analytics, Natural Language Processing, Unstructured Data, Healthcare Analytics.
Abstract: The implementation of Data Analytics has achieved a significant momentum across a very wide range of
domains. Part of that progress is directly linked to the implementation of Text Analytics solutions.
Organisations increasingly seek to harness the power of Text Analytics to automate the process of gleaning
insights from unstructured textual data. In this respect, this study aims to provide a meeting point for
discussing the state-of-the-art applications of Text Analytics in the healthcare domain in particular. It is aimed
to explore how healthcare providers could make use of Text Analytics for different purposes and contexts. To
this end, the study reviews key studies published over the past 6 years in two major digital libraries including
IEEE Xplore, and ScienceDirect. In general, the study provides a selective review that spans a broad spectrum
of applications and use cases in healthcare. Further aspects are also discussed, which could help reinforce the
utilisation of Text Analytics in the healthcare arena.
“Most of the knowledge in the world in the future
is going to be extracted by machines and will
reside in machines”, (LeCun, 2014).
The above-mentioned statement describes the ever-
rising a bundance of data -driven knowledge, which
continuously calls for further utilisation of Machine
Learning (ML). By the same token, healthca re is
delivered in da ta-rich environments where a broad
variety of data sources can be created at the individual
and population levels. The forma t of heath data
ranges from Electronic Health Records (EHR) to
images, time series, or unstructured textual notes.
Data Analytics ha s been increasingly considered
as an enabling a rtefact to leverage health data for
competitive advantage. Using a diversity of ML
techniques, ana lytics has been widely utilised to
summa rise, explain, a nd get insights into the
interrelationships underlying complex da tasets in
novel ways. Such insights can play a positive role in
various medical a nd operational aspects including
diagnosis, hea lth monitoring and assessment,
hea lthcare planning, and management of hospita ls
and health services.
However, one of the key challenges for hea lthcare
analytics is to deal with huge da ta volumes in the form
of unstructured text. Examples include nursing notes,
clinical protocols, medical tra nscriptions, medical
publica tions, and many others. In this respect, the use
of Text Analytics ha s increasingly come into
prominence in order to deliver benefits for hea lth
organisations in a wide range of applications.
Text Ana lytics, or Text Mining, is generally
defined as the methodology followed to derive quality
and actionable insights from textual da ta (Sarkar,
2019). Text Analytics represents an overarching field
of techniques a nd technologies including Na tural
Language Processing (NLP), ML, and Information
Retrieva l. The power of Text Ana lytics is to extract
information that could allow for forming and
exploring new facts or hypotheses from unstructured
textua l da ta (Hea rst, 1999).
Compa red to conventional ta sks, the obvious
cha llenge of Text Analytics is to extract pa tterns from
natural-language text, rather than well-structured
data bases. Textual data are largely stored in an
unstructured form, which does not adhere to any pre-
defined schema or data model. Further, standa rd ML
algorithms were genuinely crafted to deal with
num eric data. As such, Text Analytics need to apply
especially designed techniques and transformations
to effectively operate over textua l data.
The potentials of NLP have been constantly
discussed in the healthca re literature (e.g. Demner-
Fushman, Cha pman, and McDonald, 2009; Jensen,
Jensen, and Brunak, 2012; Spasić, Uzuner, and Zhou,
2020). In this respect, the main motivation for this
study was to explore the recent developments and
applications in this context. The study provides a
selective review tha t spans a broad spectrum of the
applications and use ca ses of Text Analytics in the
hea lthcare doma in particularly.
The review a imed to explore the sta te-of-the-art
approaches and applica tions of Text Analytics in the
hea lthcare context. We were generally motivated by
a set of exploratory questions as below:
What are the potential data sources for applying
Text Analytics in hea lthcare?
What a re the recent technological advances in
implementing Text Analytics in this context?
How could Text Analytics help healthcare
providers make better decisions?
What are the challenges of integrating NLP
tools into hea lthcare systems?
What are the key limitations of Text Analytics
in the healthcare domain?
The review incorporated two main stages. The
initial stage included the screening and selection of
studies retrieved from the search results.
Subsequently, we analysed a set of representa tive
studies to be included in the literature review. The
study sought to largely follow the procedures of a
system atic literature review as informed by (Booth,
Sutton, and Papaioannou, 2011).
The search of literature was conducted to find
relevant studies in two major digital libra ries
including: i) IEEE Xplore, and ii) ScienceDirect. It is
acknowledged tha t other relevant studies could have
been published in other conferences or journals, but
we believe that the selected venues generally
provided excellent representative studies. The review
timeframe stretched through the past 6 yea rs (i.e.
The inclusion of studies was conducted over a
three-step process for screening and classifying
studies. First, potential studies were screened based
on the title. Second, the abstracts were initially
inspected to confirm the suitability for full-text
review. Eventually, the final decision of inclusion
was made ba sed on the full-text inspection. Figure 1
sketches a flowcha rt of the review process. Ta ble 1
summa rises the search strategy.
Figure 1: The process of screening a nd selecting
studies in the review.
Table 1. Summary of search strategy.
Digital Libraries
IEEE Xplore,
Search Terms
Text Analytics Healthcare,
Text Mining Healthcare,
NLP Healthcare
Search Items
Title, Abstract, Keywords
Types of
Conference Proceedings,
Journal Articles
This section aims to provide an analysis of the studies
reviewed. The sea rch results included about 200
publica tions overall. Eventually, a set of 35 studies
were included in the review based on the process of
screening and analysis as described before.
The review is organised into two broad categories
of Text Analytics. On one hand, the first part presents
selective studies that a pplied Text Mining in the
context of healthca re. On the other hand, the second
part describes Text Analytics in a diversity of
predictive a pplications to support the clinical decision
ma king. The review is unavoidably selective rather
tha n exhaustive. However, it is believed that the study
could adequa tely provide representative studies in
each category.
3.1 Text Mining Applications in
Text Mining consists of two phases as follows. The
initial phase typically includes the application of text
refining procedures, which transform free-text
documents into another intermediate form.
Subsequently, the process of knowledge extraction,
which attempts to learn patterns or insights from that
intermediate form (Tan, 1999). This section provides
selective studies that applied Text Mining with
different moda lities and for various purposes in the
hea lthcare context.
(Han, Nandan, and Sun, 2015) presented a rule-
based system for question retrieval. The goa l was to
search for similar questions in a large corpus of
questions posted on online health forums. The system
was mainly based on the RAKE algorithm (Rose,
Engel, Cramer, a nd Cowley, 2010) to perform the
automatic extraction of keywords. Additional NLP
methods were applied using the popular NLTK
library (Bird, Klein, a nd Loper, 2009).
In another applica tion of Text Mining, a study
aimed to develop a utomated methods for extracting
information from the application webpages on the
iTunes App Store (Paglialonga, Riboldi, Tognola,
and Caiani, 2017). The study considered around 86K
applications under the categories of Medicine, and
Hea lth/Fitness. They used the NLP capabilities
provided by the IBM Watson API to identify the
medical specialty (e.g. cardiology, nutrition,
neurology, etc.), and the type of sponsor (e.g. industry
ma nufacturer, or government organisation).
Likewise, (Paglialonga et al., 2017) applied Text
Mining to automate the extraction of meaningful
information about health apps on the web.
(Lieder et al., 2019) developed a system that
could mine millions of public business webpages to
extract a multi-faceted representation of customers. In
addition, the extracted da ta were enriched with
external informa tion collected from Wikipedia. In
this respect, a large-scale knowledge graph was
constructed including millions of inter-connected
entities, which could be continuously enriched and
connected to new entities. The system could be
applied to industry use cases, such as healthca re, to
support insight discovery in real time.
In addition, several studies applied Text Mining to
extract informa tion or insights from online forums or
discussions. For instance, (Suta r, 2017) presented an
interesting a pplication of Text Mining to extract
hea lthcare-related informa tion from the user-
generated content on social media. Using a dataset
from a cancer-related forum, they developed a system
tha t could be used to extract practical information
such as treatments, medication na mes, a nd side
effects. The dataset included a set of unstructured and
semi-structured textual fields. Similarly, (Deng, Zhou,
Zhang, and Abbasi, 2019) proposed a framework to
support the a nalytics of online discussions. The
framework was named as Discussion Logic-based
Text Analytics (DiLTA). The DiLTA framework
attempted to extract features that could reveal the
discussion logic underlying online forums. The
framework was experimented using a case study
related to healthca re forum s.
(Martínez et a l., 2016) discussed exploiting the
hea lth-related online content into actionable
knowledge using Text Mining. To this end, they
developed an approa ch to help monitor online user-
generated strea ms on social Media. An NLP-ba sed
processing pipeline was applied to extract and
transform informa tion stemming from real-time
streams of social media. The system could not only
extract the mention of diseases and drugs, but a lso it
could identify useful relationships among
medications, indications, and adverse drug reactions.
(James, Calderon, and Cook, 2017) ana lysed
unstructured textual feedback of physicians. They
aimed to extract sentiments and topics perta ining to
the quality of healthcare service. Specifically, they
attempted to identify the tones and topics tha t could
shape the service ratings. In this regard, more than
20K patient reviews of more than about 4K
physicians were analysed using the Latent Dirichlet
Alloca tion (LDA) method. Further, a dictionary-
based text ana lysis was applied to determine the tone
elements in the physician reviews.
(Pendyala, a nd Figueira, 2017) explored the
potentials of Text Mining for automating the medical
diagnosis. They study applied the Bag-of-Words
representation to medical documents. To simplify the
text representa tion, the Bag-of-Words model builds a
histogram of the words, while each word count is
considered a s a feature (Goldberg, 2017). As such,
each document can be simply represented as a “bag
of words, while disregarding the order, sequence, and
gramma r of text. Though using a small data set, their
experiments demonstrated promising results for that
application. More recently, (van Dijk et al., 2020)
applied Text Mining to EHR da ta to validate the
screening eligibility of trial patients. The study was
based on a multi-centre, and multi-EHR systems as
well. The accuracy of the Text-Ming approa ch was
compa red to the standard process produced by
research personnel. The accuracy of the automatically
extracted data was about 88.0%.
(Chang et al., 2016) developed a workflow using
Text Mining to search, extract, a nd synthesise
information about Comparative Effectiveness
Resea rch (CER) in healthca re. The study included the
development of an NLP-ba sed pipeline to extract
information from unstructured CER da ta sources. The
Text-Mining solution could allow for the generation
of timely a lerts, a nd the collection of systematic
reviews as well. Their approach was experimented
using trial data from multiple sources including
Clinica, WHO International Clinical Trials
Registry Platform (ICTRP), and Citeline Trialtrove.
While other contributions focused on exploiting
Text Mining techniques for extracting concepts and
association rules from the scholarly literature. For
instance, (Kumari, a nd Ma halakshmi, 2019) applied
Text Mining to a subset of the biomedical literature
on PubMed. They aimed to discover informa tion
related to the phytochemical properties of medicinal
plants. In another applica tion, (Ji, Tian, Shen, and
Tran, 2016) developed a scalable approach to extract
associations among biomedical concepts in scientific
articles. Biomedical concepts were derived by
ma tching the text elements with the Unified Medical
Language System (UMLS) thesaurus. A MapReduce-
based algorithm was used to calculate the strength of
associations. The experimenta l dataset included a
large set of about 34K full-text articles. Their results
generally demonstrated that meaningful a ssociation
rules were highly ranked.
Recent studies considered more sophisticated
implementations ba sed on the Bidirectiona l Encoder
Representa tions from Transformers (BERT), a state-
of-the-a rt NLP model (Devlin, Chang, Lee, and
Toutanova, 2019). The BERT approach brings the
adva ntage of allowing pre-trained models to ta ckle a
broa d set of NLP tasks. In this regard, (Peterson,
Jiang, and Liu, 2020) developed a framework for
transforming free-text descriptions into a
standardised form based on the Health Level 7 (HL7)
standards. They utilised a combination of domain-
specific knowledgebases in ta ndem with the BERT
models. It was demonstrated that the BERT-based
language representation contributed significantly to
the model performance. Likewise, the literature
includes recent contributions that ma de use of the
BERT approach for a va riety of Text Mining ta sks
such as (Fan, Fan, and Smith, 2020), (Liao et al.,
2020), and (Vinod et al., 2020).
Furthermore, a major pa rt of the recent
contributions ha s been positioned in the COVID-19
context. For instance, (Jelodar, Wang, Orji, and
Huang, 2020) used Text Mining to extra ct the
COVID-19 discussions from social media. They
applied topic modeling of public opinions to gain
insights into the various issues perta ining to the
COVID-19 pandemic. In addition, they implemented
an LSTM model for the sentiment classification of
comments. While (Bha rti et al., 2020) developed a
Multilingua l conversational bot to provide primary
hea lthcare education, information, and advice to
chronic patients. Using NLP methods, the chatbot
was aimed to a ct as a personal virtual doctor to
interact with pa tients like human beings.
3.2 Text Analytics for Clinical Decision
(Tvardik et al., 2018) developed a Text-Analytics
solution for the automatic detection of medical events
using EHR data. The textua l records included data
collected from three University hospitals ba sed in
France over the period October 2009 to December
2010. The data set spa nned a variety of medical
surgica l specialities including neurosurgery,
orthopa edic surgery, a nd digestive surgery. The
system performa nce was compared with sta ndard
methods. The overall sensitivity and specificity were
about 84%. The study generally confirmed the
fea sibility of using NLP-based methods to automate
the detection and monitoring of healthcare-a ssociated
events in hospital facilities.
In another interesting applica tion, (Brown, and
Marotta, 2017) developed a set of classification
models to predict the protocol and priority of MRI
brain examina tions. They used the narrative clinical
information provided by clinicians. The models were
trained to ma ke predictions on three tasks including:
i) Selection of exa mination protocols, ii) Evalua tion
of the need for contrast administration, and iii)
Estima tion of priority. The data set consisted of about
14K MRI brain examina tions over the period of
Janua ry 2013 to June 2015. The empirical results
la rgely demonstrated tha t the models could be
effectively employed to assist the clinical decision
support in this regard.
In the context of radiology, several studies sought
to explore the a pplication of NLP methods to extract
information from the mammography reports. For
exa mple, (Ca stro et a l., 2017) developed a system to
automate the annota tion a nd classification of the
Breast Imaging Reporting a nd Data System (BI-
RADS) categories. Specifically, the system tackled
two tasks including: i) Annotation of the BI-RADS
categories, and ii) Classification of the laterality for
each BI-RADS ca tegory. The study included about
2K radiology reports collected from 18 hospitals of
the University of Pittsburgh from 2003 to 2015.
While (Miao et al., 2018) applied Deep Lea rning to
extract the BI-RADS ca tegories from breast
ultrasound reports in Chinese. The experiments
included a dataset of 540 manually annotated reports.
The model accuracy could achieve F1-score of 0.904.
(Afzal et al., 2018) applied NLP for the automatic
identification of Critical limb ischemia (CLI). The
data set included na rrative clinical notes retrieved
form the EHR database. The model performance was
validated compa red to the human abstraction of
clinical notes. Specifically, a physician reviewed and
interpreted the information in the EHR data for each
patient in the dataset. Overall, the method could
achieve an excellent F1-score of about 90%.
Using a Text-Analytics approa ch, (Carchiolo et
al., 2019) proposed a system for the automatic
classification of medical prescriptions (i.e. granta ble
or not). Initially, the textual data were sca nned from
medical prescription documents. They could develop
an effective classifier based on the data about
patient/doctor personal da ta, symptoms, pathology,
diagnosis, a nd suggested treatments. Their results
reported tha t only 5% of the prescriptions could not
be automatically classified.
Another recent study developed a framework to
realise scalable Text Ana lytics (Ge, Isah, Zulkernine,
and Kha n, 2019). The framework aimed to support
real-time analytics for decision support in a variety of
dom ains such a s healthca re for example. Deep
Learning was a pplied for NLP tasks including
language understa nding and sentiment analysis. The
framework utilised a set of open -source tools
including Spa rk Streaming for real-time text
processing a long with Zeppelin a nd Ba nana for data
visualisation. In a ddition, an LSTM model was
trained for the sentiment analysis. They practically
demonstrated the functiona lity of the framework
using a scenario with Twitter da ta.
(Kidwai, a nd Nadesh, 2020) discussed the
application of diagnostic cha tbots in hea lthcare. They
developed a cha tbot that makes use of NLP methods
to understand the user queries. After collecting the
initial symptoms, the chatbot would guide the user
through a sequence of questions towards ma king the
appropriate diagnosis. The system uses decision trees
and follows a top-down a pproa ch to conclude the
diagnosis. The cha tbot was experimented using a
medical database of about 150 disea ses.
While plentiful studies sought to develop
predictive models to help stream line hospital
admissions. Increasing contributions a ttempted to
utilise unstructured data such as free-text notes made
by nurses or physicians at the Emergency Department
(ED). For insta nce, (Sterling, Pa tzer, Di, and
Schrager, 2019) utilised the bag-of-words
representation of triage free-text notes. Using a
data set of over 250K ED visits, neural network
models were trained to predict hospital admissions.
They could achieve a promising accuracy with ROC-
AUC≈0.74. Further, (Chen et al., 2020) aimed to
compa re the performance of ML models with the
inclusion of textual elements. They applied Deep
Learning along with Word Embeddings using clinical
narratives. They practically demonstrated tha t the
model accuracy generally improved with the addition
of free-text fields.
Similarly, (Arnaud, Elbatta h, Gignon, and
Dequen, 2020) presented an approa ch based on
integrating structured data with unstructured textual
notes recorded a t the triage stage. The key idea was
to apply a multi-input of mixed da ta for training a
classification model to predict hospitalisation. On one
hand, a standard Multi-Layer Perceptron (MLP)
model was used with the sta ndard set of features (i.e.
num eric a nd categorical). On the other hand, a
Convolutional Neural Network (CNN) was used to
operate over the textua l data . Their empirical results
demonstrated that the classifier could achieve a very
good a ccuracy with ROC-AUC≈0.83.
The use of ontologies ha s a lso drawn attention in
a variety of medical a nd hea lthcare applications. To
name a few, (Chakraba rty, and Roy, 2016) used
ontology alignment for the personalisation of cancer
treatment. A patient ontology was ma pped to the
disease ontology to dyna mically transform general
treatment options into individua l intervention plans,
personalised for the patient. In a nother application,
(Comelli, Agnello, and Vita bile, 2015) proposed an
ontology-ba sed indexing and retrieva l system for the
ma mmography reports. Using a n improved
radiological ontology, medical terms were organised
in a hierarchy, which could measure the semantic
simila rity between unstructured reports. The system
was tested using a dataset of 126 ma mmographic
reports in the Italian language, provided by the
University Hospital of Palermo Policlinico.
Furthermore, part of the recent efforts explored
the applica bility of Text Analytics to predict the
Interna tional Classification of Diseases (ICD) codes.
The manual encoding process is usua lly time-
consuming, and prone to va rious errors as well. In this
regard, (Teng et a l., 2020) applied medical topic
mining and Deep Learning to automatically predict
the ICD codes from free-text medical records. The
study used the MIMIC-III dataset, which provides a
large freely a ccessible repository of ICU records
(Johnson et al. 2016). The reported results indicated
tha t their method could increase the F1-score
approximately by 5% compared to ea rlier work.
Similarly, (Gangavarapu et a l., 2020) developed an
approach to help predict the ICD-9 code groups ba sed
on unstructured nursing notes. They applied vector
space and topic modeling to structure the raw clinical
data , which allowed for ca pturing the sema ntic
information in the free-text notes.
Over the pa st five years, there ha ve been pronounced
innovations in the NLP research including novel
approaches and technologies, which in turn have
resona ted in the healthcare domain. Most remarkably,
Deep Learning has been increasingly a pplied for
developing large-scale langua ge models. Deep
architectures of CNNs have introduced a potent
mechanism for learning feature representations from
raw data automatica lly (LeCun et al. 1989; LeCun,
Bottou, Bengio, a nd Haffner, 1998). Equally
important, recent a pplications have sta rted to adopt
the BERT-based a pproach, which avails of Transfer
Learning for NLP tasks. Furthermore, scalable
analytics platforms ha ve been utilised for real-time
data processing. Examples include Apache Spark,
and IBM Watson.
In terms of data sources, it appears that Text
Analytics was applied against a broa d va riety of
hea lthcare data. The da tasets ranged from standard
EHR datasets, medical reports, free-text notes,
scientific literature, to user-generated content on
online forums or social media. In this regard, Text
Analytics was implemented for considerable
problems including extracting evidence-based ca re
interventions, and patient outcomes, or identifying
the population at risk for example. To this end, NLP
pipelines have been intensively developed for a
variety of text-processing tasks such a s: i) Named
entity recognition, ii) Topic modeling, iii) Semantic
labelling, iv) Relationship extraction, v) Question
answering, vi) Text summarisation, vii) Sentiment
analysis, and others.
Nevertheless, a set of hurdles stands in opposition
to a widesprea d implementa tion of Text Analytics in
the healthcare domain. A key challenge is the
availability of qua lity data, which is a fundamental
fa ctor for building robust NLP models, and for ML in
general. Beyond that, the underlying data biases pose
multiple ethical concerns for the deploym ent of NLP
models. Such ethical issues have been recently
discussed in the literature (e.g. Davenport, and
Kalakota, 2019; Baclic et al., 2020). While other
technical cha llenges ma y relate to the integration of
Text Analytics tools with existing healthca re systems.
The conventional IT systems may not be well-poised
to be integrated with sophisticated Text Ana lytics,
which requires an advanced infrastructure and a
highly technical skillset as well. Furthermore, the
implementation of Text Ana lytics typically requires
intensive development cycles.
In summary, it is conceived that the future holds
ma ny interesting opportunities for implementing Text
Analytics in a multitude of healthcare applications.
The need for leveraging unstructured textua l data
should bring up new practical areas for taking
adva ntage of the Text Analytics potentials.
There is an obvious need to leverage unstructured
textua l da ta to support the operations of healthca re in
ma ny aspects. A large proportion of the clinica l data
is una voidably stockpiled into unstructured, or semi-
structured, documents or notes. Text Ana lytics should
therefore play a key role in transforming textua l data
into actionable insights.
This study endeavoured to review the state-of-the-
art applica tions of Text Ana lytics in healthca re. In
this regard, the applica tions could be broadly
summa rised as follows:
Informa tion extraction from free-text data
stored in EHR data bases, clinical reports,
nursing notes, scientific literature, a nd user-
generated content.
Applying vector-ba sed representations to a
variety of clinical documents, which transforms
the textual data into an amenable form for ML.
Sequence-based modeling to address ta sks, such
as sentiment ana lysis, using notes in clinical
reports, or comments posted on online forums.
Predictive analytics applications to support the
clinical decision ma king.
Implementations of Conversational AI
technologies to use chatbots to intera ct with
patients in a human-like way.
Afzal, N., Mallipeddi, V. P., Sohn, S., Liu, H., Chaudhry,
R., Scott, C. G., ... & Arruda-Olson, A. M. (2018).
Natural language processing of clinical notes for
identification of critical limb ischemia. International
Journal of Medical Informatics, 111, 83-89.
Arnaud, E., Elbattah, M., Gignon, G. & Dequen, G. (2020).
Deep learning to predict hospitalization at triage:
Integration of structured data and unstructured text. In
Proceedings of the 2020 IEEE International Conference
on Big Data (Big Data).
Baclic, O., Tunis, M., Young, K., Doan, C., Swerdfeger, H.,
& Schonfeld, J. (2020). Challenges and opportunities
for public health made possible by advances in natural
language processing. Canada Communicable Disease
Report, 46(6), 161-168.
Bharti, U., Bajaj, D., Batra, H., Lalit, S., Lalit, S., &
Gangwani, A. (2020). Medbot: Conversational artificial
intelligence powered chatbot for delivering tele-health
after COVID-19. In Proceedings of the 2020 5th
International Conference on Communication and
Electronics Systems (ICCES), pp. 870-875. IEEE.
Bird, S., Klein, E., & Loper, E. (2009). Natural language
processing with Python: analyzing text with the natural
language toolkit. O'Reilly Media, Inc.
Booth, A., Sutton, A., & Papaioannou, D. (2011).
Systematic approaches to a successful literature review.
Brown, A. D., & Marotta, T. R. (2017). A natural language
processing-based model to automate MRI brain
protocol selection and prioritization. Academic
Radiology, 24(2), 160-166.
Carchiolo, V., Longheu, A., Reitano, G., & Zagarella, L.
(2019). Medical prescription classification: A NLP-
based approach. In Proceedings of the 2019 Federated
Conference on Computer Science and Information
Systems (FedCSIS), pp. 605-609. IEEE.
Castro, S. M., Tseytlin, E., Medvedeva, O., Mitchell, K.,
Visweswaran, S., Bekhuis, T., & Jacobson, R. S. (2017).
Automated annotation and classification of BI-RADS
assessment from radiology reports. Journal of
Biomedical Informatics, 69, 177-187.
Chakrabarty, A., & Roy, S. (2016). Personalizing
healthcare services to support decision making in
treatment of cancer patients using ontology alignment.
In Proceedings of the India International Conference
on Information Processing (IICIP), pp. 1-6. IEEE.
Chang, M., Chang, M., Reed, J. Z., Milward, D., Xu, J. J.,
& Cornell, W. D. (2016). Developing timely insights
into comparative effectiveness research with a text-
mining pipeline. Drug Discovery Today, 21(3), 473-
Chen, C. H., Hsieh, J. G., Cheng, S. L., Lin, Y. L., Lin, P.
H., & Jeng, J. H. (2020). Emergency department
disposition prediction using a deep neural network with
integrated clinical narratives and structured data.
International Journal of Medical Informatics, 104146.
Comelli, A., Agnello, L., & Vitabile, S. (2015). An
ontology-based retrieval system for mammographic
reports. In Proceedings of the 2015 IEEE Symposium
on Computers and Communication (ISCC), pp.1001-
1006). IEEE.
Davenport, T., & Kalakota, R. (2019). The potential for
artificial intelligence in healthcare. Future Healthcare
Journal, 6(2), 94.
Demner-Fushman, D., Chapman, W. W., & McDonald, C.
J. (2009). What can natural language processing do for
clinical decision support?. Journal of Biomedical
Informatics, 42(5), 760-772.
Deng, S., Zhou, Y., Zhang, P., & Abbasi, A. (2019). Using
discussion logic in analyzing online group discussions:
A text mining approach. Information & Management,
56(4), 536-551.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019).
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In
Proceedings of the Annual Conference of the North
American Chapter of the Association for
Computational Linguistics (NAACL-HLT).
Fan, B., Fan, W., & Smith, C. (2020). Adverse drug event
detection and extraction from open data: A deep
learning approach. Information Processing &
Management, 57(1), 102131.
Gangavarapu, T., Jayasimha, A., Krishnan, G. S., &
Kamath, S. (2020). Predicting ICD-9 code groups with
fuzzy similarity based supervised multi-label
classification of unstructured clinical nursing notes.
Knowledge-Based Systems, Vol. 190, 105321.
Ge, S., Isah, H., Zulkernine, F., & Khan, S. (2019). A
scalable framework for multilevel streaming data
analytics using deep learning. In Proceedings of the
IEEE 43rd Annual Computer Software and
Applications Conference (COMPSAC), Vol. 2, pp. 189-
194). IEEE.
Goldberg, Y. (2017). Neural network methods for natural
language processing. In Hirst, G. (Ed.). Synthesis
Lectures on Human Language Technologies, 10(1), p.
69. Morgan & Claypool Publishers.
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D.,
Howard, R. E., Hubbard, W. E., and Jackel, L. D.
(1989). Handwritten digit recognition with a back-
propagation network. In Proceedings of Advances in
Neural Information Processing Systems (NIPS) (pp.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
Gradient-based learning applied to document
recognition. In Proceedings of the IEEE, 86(11), 2278-
LeCun, Y. (2014). Chapter 3: Facebook. In Sebastian
Gutierrez (Eds.). Data Scientists at Work. Apress.
Liao, Z., Liu, L., Wu, Q., Teney, D., Shen, C., van den
Hengel, A., & Verjans, J. (2020). Medical Data Inquiry
Using a Question Answering Model. In Proceedings of
the 17th IEEE International Symposium on Biomedical
Imaging (ISBI) (pp. 1490-1493). IEEE.
Lieder, I., Segal, M., Avidan, E., Cohen, A., & Hope, T.
(2019). Learning a faceted customer segmentation for
discovering new business opportunities at Intel. In
Proceedings of the IEEE International Conference on
Big Data, pp. 6136-6138. IEEE.
Han, J., Nandan, N., & Sun, A. (2015). Did You Know? A
Rule-Based Approach to Finding Similar Questions on
Online Health Forums. In Proceedings of the 2015
International Conference on Healthcare Informatics,
pp. 513-514). IEEE.
Hearst, M. A. (1999). Untangling text data mining. In
Proceedings of the 37th Annual meeting of the
Association for Computational Linguistics (pp. 3-10).
James, T. L., Calderon, E. D. V., & Cook, D. F. (2017).
Exploring patient perceptions of healthcare service
quality through analysis of unstructured feedback.
Expert Systems with Applications, 71, 479-492.
Jelodar, H., Wang, Y., Orji, R., & Huang, H. (2020). Deep
sentiment classification and topic discovery on novel
coronavirus or covid-19 online discussions: NLP using
lstm recurrent neural network approach. IEEE Journal
of Biomedical and Health Informatics, vol. 24, no. 10,
pp. 2733-2742
Jensen, P. B., Jensen, L. J., & Brunak, S. (2012). Mining
electronic health records: towards better research
applications and clinical care. Nature Reviews Genetics,
13(6), 395-405.
Ji, Y., Tian, Y., Shen, F., & Tran, J. (2016). Leveraging
MapReduce to efficiently extract associations between
biomedical concepts from large text data.
Microprocessors and Microsystems, 46, 202-210.
Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng,
M., Ghassemi, M., ... & Mark, R. G. (2016). MIMIC-
III, a freely accessible critical care database. Scientific
Data, 3, 160035.
Kidwai, B., & Nadesh, R. K. (2020). Design and
development of diagnostic Chabot for supporting
primary health care systems. Procedia Computer
Science, 167, 75-84.
Kumari, B. N., & Mahalakshmi, G. S. (2019). A cloud
based knowledge discovery framework, for medicinal
plants from PubMed literature. Informatics in Medicine
Unlocked, 16, 100226.
Martínez, P., Martínez, J. L., Segura-Bedmar, I., Moreno-
Schneider, J., Luna, A., & Revert, R. (2016). Turning
user generated health-related content into actionable
knowledge through text analytics services. Computers
in Industry, 78, 43-56.
Miao, S., Xu, T., Wu, Y., Xie, H., Wang, J., Jing, S., ... &
Shan, T. (2018). Extraction of BI-RADS findings from
breast ultrasound reports in Chinese using deep learning
approaches. International Journal of Medical
Informatics, 119, 17-21.
Paglialonga, A., Riboldi, M., Tognola, G., & Caiani, E. G.
(2017). Automated identification of health apps'
medical specialties and promoters from the store
webpages. In Proceedings of the E-Health and
Bioengineering Conference (EHB), pp. 197-200. IEEE.
Paglialonga, A., Pinciroli, F., Tognola, G., Barbieri, R.,
Caiani, E. G., & Riboldi, M. (2017). e-Health solutions
for better care: Characterization of health apps to
extract meaningful information and support users'
choices. In Proceedings of the 3rd International Forum
on Research and Technologies for Society and Industry
(RTSI) (pp. 1-6). IEEE.
Pendyala, V. S., & Figueira, S. (2017). Automated medical
diagnosis from clinical data. In Proceedings of the
IEEE Third International Conference on Big Data
Computing Service and Applications (BigDataService),
pp. 185-190. IEEE.
Peterson, K. J., Jiang, G., & Liu, H. (2020). A corpus-driven
standardization framework for encoding clinical
problems with HL7 FHIR. Journal of Biomedical
Informatics, 110, 103541.
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010).
Automatic keyword extraction from individual
documents. Text Mining: Applications and Theory, 1,
Sarkar, D. (2019). Text analytics with Python: a
practitioner's guide to natural language processing.
Spasić, I., Uzuner, Ö., & Zhou, L. (2020). Emerging clinical
applications of text analytics. International Journal of
Medical Informatics, Vol. 134.
Sterling, N. W., Patzer, R. E., Di, M., & Schrager, J. D.
(2019). Prediction of emergency department patient
disposition based on natural language processing of
triage notes. International Journal of Medical
Informatics, 129, 184-188.
Sutar, S. G. (2017). Intelligent data mining technique of
social media for improving health care. In Proceedings
of the 2017 International Conference on Intelligent
Computing and Control Systems (ICICCS), pp. 1356-
1360. IEEE.
Tan, A. H. (1999). Text mining: The state of the art and the
challenges. In Proceedings of the 1999 PAKDD
Workshop on Knowledge Discovery from Advanced
Databases, Vol. 8, pp. 65-70.
Tvardik, N., Kergourlay, I., Bittar, A., Segond, F., Darmoni,
S., & Metzger, M. H. (2018). Accuracy of using natural
language processing methods for identifying
healthcare-associated infections. International Journal
of Medical Informatics, 117, 96-102.
Teng, F., Ma, Z., Chen, J., Xiao, M., & Huang, L. (2020).
Automatic medical code assignment via deep learning
approach for intelligent healthcare. IEEE Journal of
Biomedical and Health Informatics, vol. 24, no. 9, pp.
van Dijk, W. B., Fiolet, A. T., Schuit, E., Sammani, A.,
Groenhof, T. K. J., van der Graaf, R., ... & Grobbee, D.
E. (2020). Text-mining in electronic healthcare records
can be used as efficient tool for screening and data-
collection in cardiovascular trials: a multicenter
validation study. Journal of Clinical Epidemiology.
Vinod, P., Safar, S., Mathew, D., Venugopal, P., Joly, L. M.,
& George, J. (2020). Fine-tuning the BERTSUMEXT
model for Clinical Report Summarization. In
Proceedings of the 2020 International Conference for
Emerging Technology (INCET) (pp. 1-7). IEEE.
... In general, healthcare data analytics is rather uniformly perceived as an opportunity for more cost-efficient healthcare [52,53] through many applications such as automating a specialist's routine tasks so that they may focus on tasks more crucial in a patient's treatment. The cost-efficiency is likely to be more concretized by novel deep learning techniques such as large language models [54], which are also offered through implementations that perform tasks faster while consuming less resources [55]. In addition to faster diagnoses, data analytics solutions may also offer more objective diagnoses in, e.g., pathology, if the models are trained with data from multiple pathologists. ...
... Perhaps the most discussed challenge was the nature of the data and how it can be treated. Many secondary studies highlighted problems with missing data [56,57], lowquality data [54], and datasets stored in various formats which are not interoperable with each other [52,55,56]. Furthermore, some studies raised the concern of missing techniques to visualize the outputs given by different data analyses [56,58]. ...
... Furthermore, some studies raised the concern of missing techniques to visualize the outputs given by different data analyses [56,58]. Rather intuitively, many new implementations and the increases in the amount of data require new computational infrastructure for feasible use [54,[58][59][60]. Some studies raised ethical concerns regarding data collection, merging, and sharing, as data privacy is a multifaceted concept [52,54,58,59], especially when the datasets cover multiple countries with different legislations. ...
Full-text available
The field of healthcare has seen a rapid increase in the applications of data analytics during the last decades. By utilizing different data analytic solutions, healthcare areas such as medical image analysis, disease recognition, outbreak monitoring, and clinical decision support have been automated to various degrees. Consequently, the intersection of healthcare and data analytics has received scientific attention to the point of numerous secondary studies. We analyze studies on healthcare data analytics, and provide a wide overview of the subject. This is a tertiary study, i.e., a systematic review of systematic reviews. We identified 45 systematic secondary studies on data analytics applications in different healthcare sectors, including diagnosis and disease profiling, diabetes, Alzheimer’s disease, and sepsis. Machine learning and data mining were the most widely used data analytics techniques in healthcare applications, with a rising trend in popularity. Healthcare data analytics studies often utilize four popular databases in their primary study search, typically select 25–100 primary studies, and the use of research guidelines such as PRISMA is growing. The results may help both data analytics and healthcare researchers towards relevant and timely literature reviews and systematic mappings, and consequently, towards respective empirical studies. In addition, the meta-analysis presents a high-level perspective on prominent data analytics applications in healthcare, indicating the most popular topics in the intersection of data analytics and healthcare, and provides a big picture on a topic that has seen dozens of secondary studies in the last 2 decades.
... Several review studies on methods, challenges, and advances in textual emotion recognition [12][13][14] and sentiment analysis [15,16] have been presented in recent years. There are also studies on the role of NLP and text analytics in the broader healthcare space [17], such as social media-based surveillance systems [18] and more particularly, in mental health [19,20]. In addition, the work in [21] investigates sentiment analysis in health and well-being. ...
... Articles including three categories of keywords ''natural language processing'', ''emotion detection'', and ''health'' in their title, abstract, and set of keywords were of interest. Keywords in each category were identified through studying previous surveys on NLP [17,19,20], emotion detection [12][13][14][15], and sentiment analysis in health and well-being [21]. Review of these studies helped in the selection of focused health-related keywords; however, in order to exclude sentiment analysis studies and focus on the applications of fine-grained emotion identification, the word ''sentiment'' was excluded from the search query. ...
Textual Emotion Detection (TED) is a rapidly growing area in Natural Language Processing (NLP) that aims to detect emotions expressed through text. In this paper, we provide a review of the latest research and development in TED as applied in health and medicine. We focus on medical and non-medical data types, use cases, and methods where TED has been integral in supporting decision-making. The application of NLP technologies in health, and particularly TED, requires high confidence that these technologies and technology-aided treatment will first, do no harm. Therefore, this review also aims to assess the accuracy of TED systems and provide an update on the state of the technology. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines were used in this review. With a specific focus on the identification of different human emotions in text, the more general sentiment analysis studies that only recognize the polarity of text were excluded. A total of 66 papers met the inclusion criteria. This review found that TED in health and medicine is mainly used in the detection of depression, suicidal ideation, and the mental status of patients with asthma, Alzheimer's disease, cancer, and diabetes with major data sources of social media, healthcare services, and counseling centers. Approximately, 44% of the research in the domain is related to COVID-19, investigating the public health response to vaccinations and the emotional response of the public. In most cases, deep learning-based NLP techniques were found to be preferred over other methods due to their superior performance. Developing methods for implementing and evaluating dimensional emotional models, resolving annotation challenges by utilizing health-related lexicons, and using deep learning techniques for multi-faceted and real-time applications were found to be among the main avenues for further development of TED applications in health.
... However, such methods are limited by the scope of their search and are usually time-consuming. A more efficient or novel way of gaining insights from data in healthcare research is mining and analyzing online text with natural language processing (NLP) or text analytics [10]. For example, researchers of recent healthcare studies have learned public opinion and sentiment on topics such as COVID-19 vaccine boosters [11] and intimate partner violence [12] by mining textual data from social media. ...
Full-text available
We developed a workflow for the search and screening of natural products by drawing from worldwide experiences shared by online platform users, illustrated how to cope with COVID-19 with a text-mining approach, and statistically tested the natural product identified. We built a knowledge base, which consists of three ontologies pertaining to 7653 narratives. Mustard emerged from texting mining and knowledge engineering as an important candidate relating to COVID-19 outcomes. The findings indicate that, after controlling for the containment index, the net import of mustard is related with reduced total and new deaths of COVID-19 for the non-vaccination time period, with considerable effect size (>0.2).
... The PSG report's components are typically stored as a free-text note in the electronic medical record (EMR), which is not readable, processable, or computable and is considered unstructured data [4]. To convert unstructured data into structured data that is readable, processable, and computable data, researchers use various methods [5][6][7]. These methods include a traditional approach via manual reviewing of notes and converting them into the structured data. ...
Full-text available
Background: There is a need to better understand the association between sleep and chronic diseases. In this study we developed a natural language processing (NLP) algorithm to mine polysomnography (PSG) free-text notes from electronic medical records (EMR) and evaluated the performance. Methods: Using the Veterans Health Administration EMR, we identified 46,093 PSG studies using CPT code 95,810 from 1 October 2000-30 September 2019. We randomly selected 200 notes to compare the accuracy of the NLP algorithm in mining sleep parameters including total sleep time (TST), sleep efficiency (SE) and sleep onset latency (SOL), wake after sleep onset (WASO), and apnea-hypopnea index (AHI) compared to visual inspection by raters masked to the NLP output. Results: The NLP performance on the training phase was >0.90 for precision, recall, and F-1 score for TST, SOL, SE, WASO, and AHI. The NLP performance on the test phase was >0.90 for precision, recall, and F-1 score for TST, SOL, SE, WASO, and AHI. Conclusions: This study showed that NLP is an accurate technique to extract sleep parameters from PSG reports in the EMR. Thus, NLP can serve as an effective tool in large health care systems to evaluate and improve patient care.
... Previous studies on the data breaches in the healthcare industry mainly focused on analyzing well-structured breach data such as hospital types, breach types, and breach locations [8][9][10][11][12]. Recently, text mining approaches have been used in the healthcare domain to automate the process of gleaning insights from unstructured textual data [13]. The Latent Dirichlet Allocation (LDA) has been used for topic modeling in the healthcare domain [14,15]. ...
Full-text available
To address rapidly growing data breach incidents effectively, healthcare providers need to identify various insider and outsider threats, analyze the vulnerabilities of their internal security systems, and develop more appropriate data security measures against the threats. While there have been studies on trends of data breach incidents, there is a lack of research on the analysis of descriptive contents posted on the data breach reporting website of the U.S. Department of Health and Human Services (HHS) Office for Civil Rights (OCR). Hence, this study develops a novel approach to the analysis of descriptive data breach information with the use of text mining and visualization. Insider threats, vulnerabilities, breach incidents, impacts, and responses to the breaches are analyzed for three data breach types.
Full-text available
With the rapid development of social network platforms, Sina Weibo has become the main carrier for modern netizens to express public views and emotions. How to obtain the tendency of public opinion and analyze the text’s emotion more accurately and reasonably has become one of the main challenges for the government to monitor public opinion in the future. Due to the sparseness of Weibo text data and the complex semantics of Chinese, this paper proposes an emotion analysis model based on the Bidirectional Encoder Representation from Transformers pre-training model (BERT), Fast Gradient Method (FGM) and the bidirectional Gated Recurrent Unit (BiGRU), namely BERT-FGM-BiGRU model. Aiming to solve the problem of text polysemy and improve the extraction effect and classification ability of text features, this paper adopts the BERT pre-training model for word vector representation and BiGRU for text feature extraction. In order to improve the generalization ability of the model, this paper uses the FGM adversarial training algorithm to perturb the data. Therefore, a BERT-FGM-BiGRU model is constructed with the goal of sentiment analysis. This paper takes the Chinese text data from the Sina Weibo platform during COVID-19 as the research object. By comparing the BERT-FGM-BiGRU model with the traditional model, and combining the temporal and spatial characteristics, it further studies the changing trend of user sentiment. Finally, the results show that the BERT-FGM-BiGRU model has the best classification effect and the highest accuracy compared with other models, which provides a scientific method for government departments to supervise public opinion. Based on the classification results of this model and combined with the temporal and spatial characteristics, it can be found that public sentiment is spatially closely related to the severity of the pandemic. Due to the imbalance of information sources, the public showed negative emotions of fear and worry in the early and middle stages, while in the later stage, the public sentiment gradually changed from negative to positive and hopeful with the improvement of the epidemic situation.
This study aims to reveal the factors of consumer participation in official electronic and electrical waste treatment schemes based on practical approaches. An intelligent text analysis technique was employed to interact between consumers, the industrial sector, and the government. With this framework, complex calculations can be performed on textual web data, with an analyst directing the process toward accurate and detailed results. An analysis of Web reports was conducted to determine how industry and government handle consumer demand in practice. The current research approach allows tracking various entities in sustainable electronic waste management and their related trends. A total of 743 web reports were selected and analyzed according to the criteria chosen by the authors. The results revealed that elements related to awareness, costs, incentives, security, information, trust, and convenience affect consumer participation. These factors have been the approaches of the industry and the government to attract the participation of consumers.
Full-text available
Background The availability of electronic medical record (EMR) free-text data for research varies. However, access to short diagnostic text fields is more widely available. Objective This study assesses agreement between free-text and short diagnostic text data from primary care EMR for identification of posttraumatic stress disorder (PTSD). Methods This retrospective cross-sectional study used EMR data from a pan-Canadian repository representing 1574 primary care providers at 265 clinics using 11 EMR vendors. Medical record review using free text and short diagnostic text fields of the EMR produced reference standards for PTSD. Agreement was assessed with sensitivity, specificity, positive predictive value, negative predictive value, and accuracy. Results Our reference set contained 327 patients with free text and short diagnostic text. Among these patients, agreement between free text and short diagnostic text had an accuracy of 93.6% (CI 90.4%-96.0%). In a single Canadian province, case definitions 1 and 4 had a sensitivity of 82.6% (CI 74.4%-89.0%) and specificity of 99.5% (CI 97.4%-100%). However, when the reference set was expanded to a pan-Canada reference (n=12,104 patients), case definition 4 had the strongest agreement (sensitivity: 91.1%, CI 90.1%-91.9%; specificity: 99.1%, CI 98.9%-99.3%). Conclusions Inclusion of free-text encounter notes during medical record review did not lead to improved capture of PTSD cases, nor did it lead to significant changes in case definition agreement. Within this pan-Canadian database, jurisdictional differences in diagnostic codes and EMR structure suggested the need to supplement diagnostic codes with natural language processing to capture PTSD. When unavailable, short diagnostic text can supplement free-text data for reference set creation and case validation. Application of the PTSD case definition can inform PTSD prevalence and characteristics.
Full-text available
Abstract Objective: This study aimed to validate trial patient eligibility screening and baseline data-collection using text-mining in electronic healthcare records (EHRs), comparing the results to those of an international trial. Study design and setting: In three medical centers with different EHR vendors, EHR-based text-mining was used to automatically screen patients for trial eligibility and extract baseline data on nineteen characteristics. First. the yield of screening with automated EHR text-mining search was compared with manual screening by research personnel. Second, accuracy of extracted baseline data by EHR text mining was compared to manual data entry by research personnel RESULTS: 568 (0.6%) of 92,466 patients visiting the out-patient cardiology departments were enrolled in the trial during its recruitment period using manual screening methods. Automated EHR data screening of all patients showed that the number of patients needed to screen could be reduced by 73,863 (79.9%). The remaining 18,603 (20.1%) contained 458 of the actual participants (82.4% of participants). In trial participants, automated EHR text-mining missed a median of 2.8% (Interquartile range [IQR] across all variables 0.4-8.5%) of all data points compared to manually collected data. Overall accuracy of automatically extracted data was 88.0% (IQR 84.7-92.8%). Conclusion: Automatically extracting data from EHRs using text-mining can be used to identify trial participants and to collect baseline information.
Full-text available
Internet forums and public social media, such as online healthcare forums, provide a convenient channel for users (people/patients) concerned about health issues to discuss and share information with each other. In late December 2019, an outbreak of a novel coronavirus (infection from which results in the disease named COVID-19) was reported, and, due to the rapid spread of the virus in other parts of the world, the World Health Organization declared a state of emergency. In this paper, we used automated extraction of COVID-19—related discussions from social media and a natural language process (NLP) method based on topic modeling to uncover various issues related to COVID-19 from public opinions. Moreover, we also investigate how to use LSTM recurrent neural network for sentiment classification of COVID-19 comments. Our findings shed light on the importance of using public opinions and suitable computational techniques to understand issues surrounding COVID-19 and to guide related decision-making. In addition, experiments demonstrated that the research model achieved an accuracy of 81.15% — a higher accuracy than that of several other well-known machine-learning algorithms for COVID-19—Sentiment Classification.
Full-text available
Natural language processing (NLP) is a subfield of artificial intelligence devoted to understanding and generation of language. The recent advances in NLP technologies are enabling rapid analysis of vast amounts of text, thereby creating opportunities for health research and evidence-informed decision making. The analysis and data extraction from scientific literature, technical reports, health records, social media, surveys, registries and other documents can support core public health functions including the enhancement of existing surveillance systems (e.g. through faster identification of diseases and risk factors/at-risk populations), disease prevention strategies (e.g. through more efficient evaluation of the safety and effectiveness of interventions) and health promotion efforts (e.g. by providing the ability to obtain expert-level answers to any health related question). NLP is emerging as an important tool that can assist public health authorities in decreasing the burden of health inequality/inequity in the population. The purpose of this paper is to provide some notable examples of both the potential applications and challenges of NLP use in public health.
Full-text available
Technology is increasingly becoming a massive part of today’s healthcare scenario. Technology has changed the way how patients communicate with doctors and not only that, but also how healthcare is administered. Artificial intelligence and Chabots are two groundbreaking technologies that have changed how patients and doctors perceive healthcare. To make healthcare system more interactive a diagnostic Chabot is designed and developed using latest algorithms in machine learning, decision tree algorithm to help the user to form a diagnosis of their condition based on their symptoms. The system will be fed with information pertaining to various diseases and using NLP, it will be able to understand the user query and give a suitable response. The system can be used for effective information retrieval in a similar manner like siri, alexa etc but the scope will be limited to disease diagnosis.
Conference Paper
Overcrowding in Emergency Departments (ED) is considered as an international issue, which could have adverse impacts on multiple care outcomes such as the length of stay for example. Part of the solution could lie in the early prediction of the patient outcome as discharge or hospitalization. This study applies Deep Learning to this end. A large-scale dataset of about 260K ED records was provided by the Amiens-Picardy University Hospital in France. In general, our approach is based on integrating structured data with unstructured textual notes recorded at the triage stage. The key idea is to apply a multi-input of mixed data for training a classification model to predict hospitalization. In a simultaneous manner, the model training utilizes the numeric features along with textual data. On one hand, a standard Multi-Layer Perceptron (MLP) model is used with the standard set of features (i.e. numeric and categorical). On the other hand, a Convolutional Neural Network (CNN) is used to operate over the textual data. The two components of learning are conducted independently in parallel. The empirical results demonstrated that the classifier could achieve a very good accuracy with ROC-AUC≈0.83. The study is conceived to contribute to the mounting efforts of applying Natural Language Processing in the healthcare domain.
Free-text problem descriptions are brief explanations of patient diagnoses and issues, commonly found in problem lists and other prominent areas of the medical record. These compact representations often express complex and nuanced medical conditions, making their semantics challenging to fully capture and standardize. In this study, we describe a framework for transforming free-text problem descriptions into standardized Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) models. This approach leverages a combination of domain-specific dependency parsers, Bidirectional Encoder Representations from Transformers (BERT) natural language models, and cui2vec Unified Medical Language System (UMLS) concept vectors to align extracted concepts from free-text problem descriptions into structured FHIR models. A neural network classification model is used to classify thirteen relationship types between concepts, facilitating mapping to the FHIR Condition resource. We use data programming, a weak supervision approach, to eliminate the need for a manually annotated training corpus. Shapley values, a mechanism to quantify contribution, are used to interpret the impact of model features. We found that our methods identified the focus concept, or primary clinical concern of the problem description, with an F1 score of 0.95. Relationships from the focus to other modifying concepts were extracted with an F1 score of 0.90. When classifying relationships, our model achieved a 0.89 weighted average F1 score, enabling accurate mapping of attributes into HL7 FHIR models. We also found that the BERT input representation predominantly contributed to the classifier decision as shown by the Shapley values analysis.
With the development of healthcare 4.0, there has been an explosion in the amount of data such as image, medical text, physiological signals, lab tests, etc. Among them, medical records provide a complete picture of the associated clinical events. However, the processing of medical texts is difficult because they are structurally free, diverse in style, and have subjective factors. Assigning metadata codes from the International Classification of Diseases (ICD) presents a standardized way of indicating diagnoses and procedures, so it becomes a mandatory process for understanding medical records to make better clinical and financial decisions. Such a manual encoding task is time-consuming, error-prone and expensive. In this paper, we proposed a deep learning approach and a medical topic mining method to automatically predict ICD codes from text-free medical records. The result of the F1 score on MIMIC increases by 5% over the state of art. It also suitable for multiple ICD versions and languages. For the specific disease, atrial fibrillation, the F1 score is up to 96% and 93.3% using in-house ICD-10 datasets and MIMIC datasets, respectively. We developed an AI-based coding system, which can greatly improve the efficiency and accuracy of human coders, and meanwhile accelerate the secondary use for clinical informatics.
Background Emergency department (ED) overcrowding has been a serious issue and demands effective clinical decision-making of patient disposition. In previous studies, emergency clinical narratives provide a rich context for clinical decision. We aimed to develop the disposition prediction model using deep learning modeling strategy with the heterogeneous data including the physicians' narratives. Methods We constructed a retrospective cohort of all 104,083 ED visits of non-trauma adults during 2017-18 from an academically affiliated ED in Taiwan. 18,308 visits were excluded based on the completeness of each record and the unpredictable dispositions, such as out-of-hospital cardiac arrest, against-advice discharge, and escapes. We integrated subjective section of the first physicians' clinical narratives and structured data (e.g., demographics, triage vital signs, etc.) as available predictors at the first physician-patient encounter. To predict final patient disposition (i.e., hospitalization or discharge), a deep neural network (DNN) model was developed with word embedding, a common natural language processing method. We compared the proposed model to a reference model using the Rapid Emergency Medicine Score, a logistic regression model with structured data, and a DNN model with paragraph vectors. F1 score was used to measure the predictive performance for each model. Results The F1 score (with 95% CI) for the proposed model, the reference model, the logistic regression model with structured data, and the DNN model with paragraph vectors were 0.674 (0.669-0.679), 0.474 (0.469-0.479), 0.547 (0.543-0.551), and 0.602 (0.596-0.607), respectively. While analyzing the relationship between context length and predictive performance under the proposed model, the F1 score at 95th percentile of the word counts was higher than that at 25th percentile of the word counts in chief complaint [0.634 (0.629-0.640) vs. 0.624 (0.620-0.628)] and in present illness [0.671 (0.667-0.674) vs. 0.654 (0.651-0.658)], but not in past medical history [0.674 (0.669-0.679) vs. 0.673 (0.666-0.679)]. Conclusions The proposed deep learning model with the usage of the first physicians' clinical narratives and structured data based on natural language processing outperformed the commonly used ones in terms of F1 score. It also evidenced the importance of the subjective section of clinical narratives, which serve as vital predictors for ED clinical decision-making.