ArticlePDF Available

Managing Unstructured Big Data in Healthcare System

Authors:
  • Seoul National University Hospital
The digital transformation or the 4th industrial revolution
which are very recent information technology (IT) agenda
make many countries expect the big data to be a source of
new economic value that will determine the success and
failure of those governments in the future. Due to this trend,
the big data industry in the healthcare field has been grow-
ing rapidly in recent years and several global IT companies
in the United States and Europe are reporting big data use
cases in the medical field.
Medical big data refers to large-scale data that is difficult
to handle with existing database management systems in a
digitalized healthcare environment including medical cen-
ters, wearable devices, and social medias. The medical data,
which are exploding exponentially, also include large volume
of structured and unstructured data as other domains [1].
The big problem of healthcare fields is that about 80% of
medical data remains unstructured and untapped after it is
created (e.g., text, image, signal, etc.) [2]. Since it is hard to
handle this type of data for Electronic Medical Record or
most hospital information system, it tends to be ignored, un-
saved, or abandoned in most medical centers for a long time
[3]. Although a lot of data are still created in many hospitals,
it is hard to be connected with medical big data research
and artificial intelligence industry in healthcare. Therefore,
we need to manage those unmanaged unstructured big data
in healthcare systems before mentioning development of
medical artificial intelligence which is currently based on
machine learning technology.
In many hospitals, time series data are most unmanaged
out of many types of unstructured medical data owing to its
huge file size despite of the great value of their application.
Typical unstructured big data in hospital are as following.
The first type of data is medical video data that are recently
created explosively from new types of medical imaging de-
vices (e.g., endoscope, laparoscope, surgery robot, capsule
endoscope, emergency video camera, thoracoscope, etc.).
The second one is biosignal data that have been displayed
on screen of patient monitor in operating rooms or intensive
care units and wearable health monitoring devices. The third
one is audio data that are verbally or nonverbally created
from patients pathophysiologically and medical staffs for ef-
ficient communication in clinical procedures.
For enhancing the use of these unstructured medical big
data, we need to establish the data collection, anonymiza-
tion, and quality assurance processes. And meta data for
each types of unstructured medical data need to be defined,
standardized, extracted, and visualized automatically. Then
open platform for integration and utilization of the unstruc-
tured clinical data should be developed while reflecting these
concepts.
Even if machine learning technologies with high accuracy
were developed, it would be useless without quality con-
trolled, standardized and structured data for the unstruc-
tured medical big data. Besides, field-oriented education
programs for nurturing multidisciplinary specialist who
are able to interpret, analyze and utilize the unstructured
medical big data should be discussed altogether with related
healthcare industry-side.
Managing Unstructured Big Data in Healthcare
System
Hyoun-Joong Kong
Editorial Taskforce Member of Healthcare Informatics Research, Chungnam National University, Daejeon, Korea
Healthc Inform Res. 2019 January;25(1):1-2.
https://doi.org/10.4258/hir.2019.25.1.1
pISSN 2093-3681 • eISSN 2093-369X
Editorial
This is an Open Access article distributed under the terms of the Creative Com-
mons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-
nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduc-
tion in any medium, provided the original work is properly cited.
2019 The Korean Society of Medical Informatics
2www.e-hir.org
Hyoun-Joong Kong
https://doi.org/10.4258/hir.2019.25.1.1
References
1. Weber GM, Mandl KD, Kohane IS. Finding the missing
link for big biomedical data. JAMA 2014;311(24):2479-
80.
2. HIT Consultant. Why unstructured data holds the key
to intelligent healthcare systems [Internet]. Atlanta
(GA): HIT Consultant; 2015 [cited at 2019 Jan 15].
Available from: https://hitconsultant.net/2015/03/31/
tapping-unstructured-data-healthcares-biggest-hurdle-
realized/#.XFvZ1lwvOUk.
3. Pak HS. Unstructured data in healthcare [Internet].
Fremont (CA): Healthcare Tech Outlook; c2018 [cited
at 2019 Jan 15]. Available from: https://artificial-intelli-
gence.healthcaretechoutlook.com/cxoinsights/unstruc-
tured-data-in-healthcare-nid-506.html.
... These comprehensive repositories of patient information can enhance the quality of care for individual patients through data analytics techniques that take advantage of patterns and trends derived from the records as a whole [5]. Unstructured data in EHRs accounts for over 80% of patient information, offering a valuable resource for gaining knowledge [6,7], but it is rarely used [8] as it is much harder to utilise given the ambiguity of free text expressions [6,9]. Furthermore, utilising one-for-all methods to extract the unstructured portion of EHRs could introduce errors such as perceived data distribution, complicating the process [10]. ...
... These comprehensive repositories of patient information can enhance the quality of care for individual patients through data analytics techniques that take advantage of patterns and trends derived from the records as a whole [5]. Unstructured data in EHRs accounts for over 80% of patient information, offering a valuable resource for gaining knowledge [6,7], but it is rarely used [8] as it is much harder to utilise given the ambiguity of free text expressions [6,9]. Furthermore, utilising one-for-all methods to extract the unstructured portion of EHRs could introduce errors such as perceived data distribution, complicating the process [10]. ...
Article
Full-text available
Natural language processing (NLP) and machine learning (ML) techniques may help harness unstructured free-text electronic health record (EHR) data to detect adverse drug events (ADEs) and thus improve pharmacovigilance. However, evidence of their real-world effectiveness remains unclear. To summarise the evidence on the effectiveness of NLP/ML in detecting ADEs from unstructured EHR data and ultimately improve pharmacovigilance in comparison to other data sources. A scoping review was conducted by searching six databases in July 2023. Studies leveraging NLP/ML to identify ADEs from EHR were included. Titles/abstracts were screened by two independent researchers as were full-text articles. Data extraction was conducted by one researcher and checked by another. A narrative synthesis summarises the research techniques, ADEs analysed, model performance and pharmacovigilance impacts. Seven studies met the inclusion criteria covering a wide range of ADEs and medications. The utilisation of rule-based NLP, statistical models, and deep learning approaches was observed. Natural language processing/ML techniques with unstructured data improved the detection of under-reported adverse events and safety signals. However, substantial variability was noted in the techniques and evaluation methods employed across the different studies and limitations exist in integrating the findings into practice. Natural language processing (NLP) and machine learning (ML) have promising possibilities in extracting valuable insights with regard to pharmacovigilance from unstructured EHR data. These approaches have demonstrated proficiency in identifying specific adverse events and uncovering previously unknown safety signals that would not have been apparent through structured data alone. Nevertheless, challenges such as the absence of standardised methodologies and validation criteria obstruct the widespread adoption of NLP/ML for pharmacovigilance leveraging of unstructured EHR data.
... Unstructured data consists primarily of narrative free text and is much less organized. However, it comprises upwards of 80% of clinically relevant data within the EHR and is thus a trove of largely untapped potential [7,8]. Unstructured data are found primarily within provider notes, patientprovider messages, ancillary staff assessments and care plans, discharge summaries, and diagnostic study or procedure reports, among others. ...
Article
Full-text available
Natural language processing (NLP) is a burgeoning field of machine learning/artificial intelligence that focuses on the computational processing of human language. Researchers and clinicians are using NLP methods to advance the field of medicine in general and in heart failure (HF), in particular, by processing vast amounts of previously untapped semi-structured and unstructured textual data in electronic health records. NLP has several applications to clinical research, including dramatically improving processes for cohort assembly, disease phenotyping, and outcome ascertainment, among others. NLP also has the potential to improve direct clinical care through early detection, accurate diagnosis, and evidence-based management of patients with HF. In this state-of-the-art review, we present a general overview of NLP methods and review clinical and research applications in the field of HF. We also propose several potential future directions of this emerging and rapidly evolving technological breakthrough. Graphical abstract
... Key inputs for AI include structured, unstructured, and genomic data from sources such as electronic health records, medical imaging, clinical notes, and multi-omics datasets [52,53]. Structured data is often analyzed using ML techniques, including support vector machines and neural networks, while unstructured data, which comprises approximately 80% of medical information, requires specialized approaches like NLP to extract valuable insights [54]. Genomic data, encompassing DNA/RNA sequencing and singlecell genomics, is processed using advanced DL architectures to uncover biological patterns and inform treatment strategies [55]. ...
Article
Full-text available
The healthcare sector is undergoing a significant transformation driven by Artificial Intelligence (AI). AI applications in clinical practice offer a multitude of benefits for patient care, including earlier and more accurate diagnoses, personalized treatment planning, and improved access to information through virtual assistants. However, alongside this potential, challenges and ethical considerations remain. Data privacy, algorithmic bias, transparency of AI decision-making, and responsible use are crucial areas that require careful attention. Our presentation emphasizes the importance of establishing robust best practices within healthcare institutions and fostering collaboration among clinicians, data scientists, patients, and policymakers. Through careful consideration and ongoing refinement of AI technologies, we can leverage its potential to improve patient outcomes while upholding ethical standards and public health priorities.
... A medical report is an official document authored by medical professionals that contains comprehensive details about a patient's diagnosis, treatment, and therapy. However, many medical reports suffer from poor sentence grammar [27], unstructured formatting [28], and large file sizes. Consequently, preprocessing is essential to facilitate more efficient analysis. ...
Article
Full-text available
Even though medical reports have been digitized, they are generally text data and have not been used optimally. Extracting information from these reports is challenging due to their high volume and unstructured nature. Analyzing the extraction of relevant and high-quality information can be achieved by measuring semantic textual similarity (STS). Consequently, the primary aim of this study is to develop and evaluate the performance of four models: Siamese Manhattan convolution neural network (CNN), Siamese Manhattan long short-term memory (LSTM), Siamese Manhattan hybrid CNN-LSTM, and Siamese Manhattan hybrid LSTM-CNN, in determining STS between sentence pairs in medical reports. Performance comparisons were conducted using Cosine Similarity and word mover's distance (WMD) methods. The results indicate that the Siamese Manhattan hybrid LSTM-CNN model outperforms the other models, with a similarity score of 1 for each sentence pair, signifying identical semantic meaning.
... With the adoption of electronic health records over the past decades, every patient encounter, investigation, diagnosis, and discussion are being recorded and stored. It is estimated that 80% of EHR(electronic health record) data exists in an unstructured format [1], this data consists of free-text documents filled with medical jargon, short-hand and abbreviations. To draw valuable insights and trends from clinical text data, it needs to be structured in a format digestible for computational models, only then can we deliver meaningful clinical impact. ...
Article
Full-text available
Purpose of Review Embedding machine learning workflows into real-world hospital environments is essential to ensure model alignment with clinical workflows and real-world data. Many non-healthcare industries undergoing digital transformation have already developed data labelling and data quality management services as a vertically integrated business process. Recent Findings In this paper, we describe our experiences developing and implementing a first-of-its-kind clinical NLP (natural language processing) service in the National Health Service, United Kingdom using parallel harmonised platforms. We report on our work developing clinical NLP resources and implementation framework to distil expert clinical knowledge into our NLP models. To date, we have amassed over 26,086 annotations spanning 556 SNOMED CT concepts working with secondary care specialties. Summary Our integrated language modelling service has delivered numerous clinical and operational use-cases using named entity recognition (NER). Such services improve efficiency of healthcare delivery and drive downstream data-driven technologies. We believe it will only be a matter of time before NLP services become an integral part of healthcare providers.
Article
Background Electronic health records (EHRs) consist of both structured data (eg, diagnostic codes) and unstructured data (eg, clinical notes). It is commonly believed that unstructured clinical narratives provide more comprehensive information. However, this assumption lacks large-scale validation and direct validation methods. Objective This study aims to quantitatively compare the information in structured and unstructured EHR data and directly validate whether unstructured data offers more extensive information across a patient population. Methods We analyzed both structured and unstructured data from patient records and visits in a large Dutch primary care EHR database between January 2021 and January 2024. Clinical concepts were identified from free-text notes using an extraction framework tailored for Dutch and compared with concepts from structured data. Concept embeddings were generated to measure semantic similarity between structured and extracted concepts through cosine similarity. A similarity threshold was systematically determined via annotated matches and minimized weighted Gini impurity. We then quantified the concept overlap between structured and unstructured data across various concept domains and patient populations. Results In a population of 1.8 million patients, only 13% of extracted concepts from patient records and 7% from individual visits had similar structured counterparts. Conversely, 42% of structured concepts in records and 25% in visits had similar matches in unstructured data. Condition concepts had the highest overlap, followed by measurements and drug concepts. Subpopulation visits, such as those with chronic conditions or psychological disorders, showed different proportions of data overlap, indicating varied reliance on structured versus unstructured data across clinical contexts. Conclusions Our study demonstrates the feasibility of quantifying the information difference between structured and unstructured data, showing that the unstructured data provides important additional information in the studied database and populations. The annotated concept matches are made publicly available for the clinical natural language processing community. Despite some limitations, our proposed methodology proves versatile, and its application can lead to more robust and insightful observational clinical research.
Article
It has been argued that big data will enable efficiencies and accountability in health care.1,2 However, to date, other industries have been far more successful at obtaining value from large-scale integration and analysis of heterogeneous data sources. What these industries have figured out is that big data becomes transformative when disparate data sets can be linked at the individual person level. In contrast, big biomedical data are scattered across institutions and intentionally isolated to protect patient privacy. Both technical and social challenges to linking these data must be addressed before big biomedical data can have their full influence on health care. It is this linkage challenge that we address in this Viewpoint.
Unstructured data in healthcare
  • H S Pak
Pak HS. Unstructured data in healthcare [Internet].