Chapter

Role and Challenges of Unstructured Big Data in Healthcare

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Unprecedented growth in the volume of unstructured healthcare data has immense potential in valuable insight extraction, improved healthcare services, quality patient care, and secure data management. However, technological advancements are required to achieve the potential benefits from unstructured data in healthcare according to the growth rate. The heterogeneity, diversity of sources, quality of data and various representations of unstructured data in healthcare increases the number of challenges as compared to structured data. This systematic review of the literature identifies the challenges and problems of data-driven healthcare due to the unstructured nature of data. The systematic review was carried out using five major scientific databases: ACM, Springer, ScienceDirect, PubMed, and IEEE Xplore. The inclusion of articles in review at the initial stage was based on English language and publication date from 2010 to 2018. A total of 103 articles were selected according to the inclusion criteria. Based on the review, various types of healthcare unstructured data have been discussed from different domains of healthcare. Also, potential challenges associated with unstructured big data have been identified in healthcare for future research directions in the technological advancement of healthcare services and quality patient care.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, these studies are limited to LLM applications for corpus annotation in NER tasks. Although automatic corpus annotation is feasible, it is typically confined to annotating simpler information, such as named entities (Adnan et al., 2019). This ...
... Open-domain EE systems are used to locate and gather data on events originating from general domain sources. The main focus of open-domain EE is to extract the event occurrence and determine the type of event itself (Adnan et al., 2019). Conversely, closed-domain EE systems are used to handle events from a specific domain since they utilize a predetermined event schema for the extraction procedure, such as security, justice, finance, biological activities, and chemical reactions. ...
Article
Full-text available
Purpose The purpose of this study is to serve as a comprehensive review of the existing annotated corpora. This review study aims to provide information on the existing annotated corpora for event extraction, which are limited but essential for training and improving the existing event extraction algorithms. In addition to the primary goal of this study, it provides guidelines for preparing an annotated corpus and suggests suitable tools for the annotation task. Design/methodology/approach This study employs an analytical approach to examine available corpus that is suitable for event extraction tasks. It offers an in-depth analysis of existing event extraction corpora and provides systematic guidelines for researchers to develop accurate, high-quality corpora. This ensures the reliability of the created corpus and its suitability for training machine learning algorithms. Findings Our exploration reveals a scarcity of annotated corpora for event extraction tasks. In particular, the English corpora are mainly focused on the biomedical and general domains. Despite the issue of annotated corpora scarcity, there are several high-quality corpora available and widely used as benchmark datasets. However, access to some of these corpora might be limited owing to closed-access policies or discontinued maintenance after being initially released, rendering them inaccessible owing to broken links. Therefore, this study documents the available corpora for event extraction tasks. Research limitations Our study focuses only on well-known corpora available in English and Chinese. Nevertheless, this study places a strong emphasis on the English corpora due to its status as a global lingua franca, making it widely understood compared to other languages. Practical implications We genuinely believe that this study provides valuable knowledge that can serve as a guiding framework for preparing and accurately annotating events from text corpora. It provides comprehensive guidelines for researchers to improve the quality of corpus annotations, especially for event extraction tasks across various domains. Originality/value This study comprehensively compiled information on the existing annotated corpora for event extraction tasks and provided preparation guidelines.
... Data com patibility has also become a primary concern in the process of data integration and interoperability between 2 different data architecture [40]. According to Adnan et al. [34], the use of data from 2 different systems can cause problems in data integration due to diversity in data type and lack of harmonization between different data. A study had proposed the use middleware to integrate the spatial and non-spatial data in National Lake Database of Malaysia (MyLake) [40]. ...
... For example, the major barrier in adopting big data in the health care sector is due to data compatibility issues in integrating big data into electronic health records (EHR) [70], [71]. This concern was discussed in studies carried out in the health care sector of Malaysia, United States and Europe [34], [41], [70]. ...
Article
Full-text available
Big data has played an ever-increasing role in various sectors of the economy. Despite the availability of big data technologies, many companies and organizations in Malaysia remain reluctant to adopt them. Numerous studies have been published on big data adoption; however, there is a lack of research focusing on identifying the challenges faced by Malaysian organizations. Therefore, this study will implement the technology-organization-environment (TOE) framework to examine the challenges faced by Malaysian organizations with regards to big data adoption. A systematic literature review (SLR) was conducted to examine the challenges. From the result of this study, it was found that the factors from technology context are deemed to be the major challenge faced in big data adoption followed by organization and environment factors. Furthermore, the insights derived from the TOE framework-based information can help address concerns that hinder big data adoption among organizations in Malaysia. Finally, this study concludes with several recommendations.
... However, there is no unified, standardized definition of digital unstructured data in health research. In the literature, digital unstructured data are often referred interchangeably as "big data", "digital data", "unstructured textual data" and described as "high-dimensional", "large-scale", "rich", "multivariate" or "raw" [1,4,[6][7][8][9][10]. Digital unstructured data are a valuable source of information that may not be captured in structured data and can complement the knowledge base to enable data enrichment to further inform health research. ...
... Data Quality [18,38,49,50] For observational studies, the checklist DAQCORD [53] might be a useful starting point. 6. Can the consistency of data be secured? ...
Article
Full-text available
Digital data play an increasingly important role in advancing health research and care. However, most digital data in healthcare are in an unstructured and often not readily accessible format for research. Unstructured data are often found in a format that lacks standardization and needs significant preprocessing and feature extraction efforts. This poses challenges when combining such data with other data sources to enhance the existing knowledge base, which we refer to as digital unstructured data enrichment. Overcoming these methodological challenges requires significant resources and may limit the ability to fully leverage their potential for advancing health research and, ultimately, prevention, and patient care delivery. While prevalent challenges associated with unstructured data use in health research are widely reported across literature, a comprehensive interdisciplinary summary of such challenges and possible solutions to facilitate their use in combination with structured data sources is missing. In this study, we report findings from a systematic narrative review on the seven most prevalent challenge areas connected with the digital unstructured data enrichment in the fields of cardiology, neurology and mental health, along with possible solutions to address these challenges. Based on these findings, we developed a checklist that follows the standard data flow in health research studies. This checklist aims to provide initial systematic guidance to inform early planning and feasibility assessments for health research studies aiming combining unstructured data with existing data sources. Overall, the generality of reported unstructured data enrichment methods in the studies included in this review call for more systematic reporting of such methods to achieve greater reproducibility in future studies.
... Activities can be of different kinds such as sitting, standing, walking, running, eating, going upstairs and downstairs in a home environment. The rapid development of mobile computing, smart sensing and IoT technologies has led to a rich set of health-related data that can be used for various healthcare applications including HAR classifiers [1]. For HAR, sensors such as wireless cameras, accelerometers, gyroscope sensors, wearables and other body sensors are often used. ...
... These sensors provide measurements at a sampling frequency of 115 Hz. The frequency of 115 Hz establishes a sufficient condition 1 In this paper, we use the terms 'edge' and 'client' interchangeably. for a sample rate that permits a discrete sequence of samples to capture all the information from a continuous-time human activity signal. ...
Preprint
Full-text available
Human Activity Recognition (HAR) has been a challenging problem yet it needs to be solved. It will mainly be used for eldercare and healthcare as an assistive technology when ensemble with other technologies like Internet of Things(IoT). HAR can be achieved with the help of sensors, smartphones or images. Deep neural network techniques like artificial neural networks, convolutional neural networks and recurrent neural networks have been used in HAR, both in centralized and federated setting. However, these techniques have certain limitations. RNNs have limitation of parallelization, CNNS have the limitation of sequence length and they are computationally expensive. In this paper, to address the state of art challenges, we present a inertial sensors-based novel one patch transformer which gives the best of both RNNs and CNNs for Human activity recognition. We also design a testbed to collect real-time human activity data. The data collected is further used to train and test the proposed transformer. With the help of experiments, we show that the proposed transformer outperforms the state of art CNN and RNN based classifiers, both in federated and centralized setting. Moreover, the proposed transformer is computationally inexpensive as it uses very few parameter compared to the existing state of art CNN and RNN based classifier. Thus its more suitable for federated learning as it provides less communication and computational cost.
... The Volume, Variety, Value are important Vs of big data that adds more challenges to big data techniques and technologies. The Variety of big data presents the diversity of different data types from diverse sources make it difficult to process, store, and analyze [31]. Moreover, big data quality is one of the most important and critical paradigms that affects data processing, collection, and analysis [7]. ...
... Considering the existing usability dimensions and indicators of data usability, the proposed model has been designed. This model is the outcome of extensive literature review of the field [13], [31], [40], [44]. ...
... As the amount of digital content grows exponentially, it becomes necessary to handle this and implement data mining tools [1]. Records management, laws and regulatory obligations, and primary care have traditionally created lots of data. ...
Article
Full-text available
Large population makes it harder to treat all patients with the allocated doctors, resulting in rising healthcare expenses. Data inconsistency, high-dimensionality, and sparseness made analyzing and managing health records difficult. In today's world, accurate health forecasting is vitally important. Big data analysis is critical in predicting future health outcomes and providing the best possible health care. So, the medical profession was chosen first from the Big Data. Inadequate health data reduces inspection process. Furthermore, several regional disorders have distinct regional features, making illness outbreak detection difficult. So, we present adaptive recurrent neural network-long short-term memory (A-RNN-LSTM) and cognitive fuzzy-based spider monkey optimization (CF-SMO) techniques are presented. First, the datasets are gathered from the Big-data and that are split into training and testing set in pre-processing stage by using normalization technique for outlier-removal, data cleaning, and de-noising purposes. Then, features from the training dataset are retrieved by employing the Principal Component Analysis (PCA) technique. Significant features from the extracted data are selected via multi-objective ant colony optimization (MACO) approach. After that, the A-RNN-LSTM technique is used for disease prediction and also the CF-SMO technique is employed to enhance the rate of prediction. Lastly, performances of our proposed techniques are analyzed and that are compared with existing techniques to obtain our research with the highest accurateness. Findings are depicted by using the MATLAB tool.
... has been limited by challenges in applying NLP to extract key information, which still needs to be overcome [24]- [26]. ...
Article
Full-text available
Pediatric Emergency Department (PED) overcrowding presents a significant global challenge, prompting the need for efficient solutions. This paper introduces the BioBridge framework, a novel approach that applies Natural Language Processing (NLP) to Electronic Medical Records (EMRs) in written free-text form to enhance decision-making in PED. In non-English speaking countries, such as South Korea, EMR data is often written in a Code-Switching(CS) format that mixes the native language with English, with most code-switched English words having clinical significance. The BioBridge framework consists of two core modules: “bridging modality in context” and “unified bio-embedding.” The “bridging modality in context” module improves the contextual understanding of bilingual and code-switched EMRs. In the “unified bio-embedding” module, the knowledge of the model trained in the medical domain is injected into the encoder-based model to bridge the gap between the medical and general domains. Experimental results demonstrate that the proposed BioBridge significantly performance traditional machine learning and pre-trained encoder-based models on several metrics, including F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall Curve (AUPRC), and Brier score. Specifically, BioBridge-XLM achieved enhancements of 0.85% in F1 score, 0.75% in AUROC, and 0.76% in AUPRC, along with a notable 3.04% decrease in the Brier score, demonstrating marked improvements in accuracy, reliability, and prediction calibration over the baseline XLM model. The code can be found in https://github. com/jjy961228/BioBridge.
... % of medical terms in English words % of English words in all words (10) This can be expressed mathematically as: ...
Article
Full-text available
The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in https://github.com/JoSangYeon/DSG-KD.
... These findings indicate that the information extracted from text notes in EMR is under-utilized and still worth investigating. As a result, effectively exploiting and leveraging unstructured medical data has great potential in benefiting healthcare analytics (Adnan et al., 2020a). ...
Preprint
Full-text available
Most of the existing medication recommendation models are predicted with only structured data such as medical codes, with the remaining other large amount of unstructured or semi-structured data underutilization. To increase the utilization effectively, we proposed a method of enhancing medication recommendation with Large Language Model (LLM) text representation. LLM harnesses powerful language understanding and generation capabilities, enabling the extraction of information from complex and lengthy unstructured data such as clinical notes which contain complex terminology. This method can be applied to several existing base models we selected and improve medication recommendation performance with the combination representation of text and medical codes experiments on two different datasets. LLM text representation alone can even demonstrate a comparable ability to the medical code representation alone. Overall, this is a general method that can be applied to other models for improved recommendations.
... Data pre-processing. Effective data analysis in a medical context hinges upon the quality and coherence of the underlying data [7]. Pre-processing steps were implemented to ensure the dataset's reliability, interpretability, and clinical relevance by renaming, recoding, transforming variables, and handling missing values. ...
Article
Full-text available
Background The global evolution of pre-hospital care systems faces dynamic challenges, particularly in multinational settings. Machine learning (ML) techniques enable the exploration of deeply embedded data patterns for improved patient care and resource optimisation. This study’s objective was to accurately predict cases that necessitated transportation versus those that did not, using ML techniques, thereby facilitating efficient resource allocation. Methods ML algorithms were utilised to predict patient transport decisions in a Middle Eastern national pre-hospital emergency medical care provider. A comprehensive dataset comprising 93,712 emergency calls from the 999-call centre was analysed using R programming language. Demographic and clinical variables were incorporated to enhance predictive accuracy. Random Forest (RF), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), and Adaptive Boosting (AdaBoost) algorithms were trained and validated. Results All the trained algorithm models, particularly XGBoost (Accuracy = 83.1%), correctly predicted patients’ transportation decisions. Further, they indicated statistically significant patterns that could be leveraged for targeted resource deployment. Moreover, the specificity rates were high; 97.96% in RF and 95.39% in XGBoost, minimising the incidence of incorrectly identified “Transported” cases (False Positive). Conclusion The study identified the transformative potential of ML algorithms in enhancing the quality of pre-hospital care in Qatar. The high predictive accuracy of the employed models suggested actionable avenues for day and time-specific resource planning and patient triaging, thereby having potential to contribute to pre-hospital quality, safety, and value improvement. These findings pave the way for more nuanced, data-driven quality improvement interventions with significant implications for future operational strategies.
... As healthcare data volume continues to expand exponentially, driven by electronic health records, genomics, and various forms of health-related information, the potential to harness these vast datasets for predictive analytics and personalized medicine is significant [1], [2], [3]. However, the utilization of big data in healthcare is not without challenges, particularly concerning data security and privacy [4], [5], [6]. This review explores the dual facets of big data in healthcare, examining the profound opportunities it presents for patient care and the serious security concerns accompanying its use. ...
... On the other hand, unstructured data, lacking a predefined format, offers a diverse range of information crucial for creating inclusive and sustainable talent management strategies [52]. Such data, including information from video interviews and social media, is vital for cultivating a diverse and inclusive workforce, a key element in sustainable organizational development [53]. ...
Article
Full-text available
This study elucidates the transformative influence of data integration on talent management in the context of evolving technological paradigms, with a specific focus on sustainable practices in human resources. Historically anchored in societal norms and organizational culture, talent management has transitioned from traditional methodologies to harnessing diverse data sources, a shift that enhances sustainable HR strategies. By employing a narrative literature review, the research traces the trajectory of HR data sources, emphasizing the juxtaposition of structured and unstructured data. The digital transformation of HR is explored, not only highlighting the evolution of Human Resource Information Systems (HRIS) but also underscoring their role in promoting sustainable workforce management. The integration of advanced technologies such as machine learning and natural language processing is examined, reflecting on their impact on the efficiency and ecological aspects of HR practices. This paper not only underscores the imperative of balancing data-driven strategies with the quintessential human element of HR but also provides concrete examples demonstrating this balance in action for practitioners and scholars in sustainable human resources.
... Choi et al. (2020) pointed out that media literacy enables people to be "critical thinkers" and "creative producers" of messages in images, speech, and sound. Media literacy is the basic competence of a modern citizen who should understand the different types and signs of multimedia (Adnan et al., 2020) and have the ability to receive messages purposefully, such as identifying information, understanding message elements, evaluating messages, responding to a message, and recognizing the effect of multimedia (Chen et al., 2021). In a recent study, Hayes (2021) suggested the level of cognition as mental processing and thinking; people with better literacy skills would go beyond surface information and pay attention to the depth of the content. ...
Article
Participation in social networking sites offers many potential benefits for university students. Online interaction on these sites provides various opportunities for them to learn and improve self-control, tolerate and respect the viewpoints of others, express emotions in healthy and orderly ways, and think and make decisions critically. These sites also provide them with a virtual space to execute time, form close connections with friends without being spatially restricted and provide space for young people’s self-development. However, the number of studies examining university students’ social networking sites, media literacy, and critical thinking is very limited in the literature. Therefore, this research examined the effects of motivation to use social networking sites on students’ media literacy and critical thinking. The research also examined the relationships between students’ motivation for using social networks, media literacy, and critical thinking. The data were collected using three data collection instruments. The participants were 211 university students enrolled at two universities in Bangkok, Thailand. The results showed significant positive correlations between motivation to use social networking sites, and critical thinking, that university students with better performance in information and learning show better performance in critical thinking and reflection skills. The results also showed remarkable positive correlations between motivation for using social networking sites and media literacy, indicating that university students with better performance in information and learning show better performance in multimedia messages and multimedia organization and analysis. In addition, the results also revealed positive correlations between critical thinking and media literacy. The implications are made based on the results obtained from this research.
... Only eight dashboards enabled this option, of which Unstructured notes refer to the ability to enter freely written notes into the EHR [22]- [24], [43], [45], [47], [49], [58], [61]. This type of clinical data poses a lot of challenges because of the slow process of insight extraction [6], [75], [76], however, it is an important source of information that can contain elements worth representing in EHR visualisations. Our analysis suggests that unstructured notes are often excluded in these systems, possibly because of their qualitative nature, which does not fit well to commonly used graphical elements or visualisation techniques. ...
Conference Paper
Visualisations in Electronic Health Records (EHRs) are crucial for clinical care. Since clinicians need to quickly diagnose and treat their patients, having appropriate ways to visualise patients’ characteristics and issues documented in the EHR, can be instrumental. However, the existing literature has not yet summarised the characteristics and lessons learned from the studies on patient dashboards for clinical care. Our review analysed patient dashboards, that visualised EHR data to support clinical care, and which were evaluated with end-users. We read papers from Human-Computer Interaction, Information Visualisation, and Medical Informatics, focusing on the user interfaces and the end-user evaluation results. From a set of 3545 articles, we selected 30 studies, which were analysed using Thematic Analysis. Results provide an understanding of the patient dashboard designs, the visualisation techniques employed, the data represented, as well as the lessons learned from this body work; which should contribute to future designs.
... The obtained data are typically stored in widely dispersed data repositories, often in various formats with different data owners. In both human and veterinary medicine, there are four key challenges impeding consistent and seamless data evaluation [4,48]: 1. The heterogeneity of the data: This necessitates additional processing steps before a comparative analysis. ...
Article
Full-text available
Simple Summary Technological and social progress are often closely linked. There is a consensus that future medicine will benefit from the use of available health data. The form in which these health data will be available is crucial. For a comprehensive analysis, it is necessary to harmonize and link these data and enable seamless use across system boundaries. In this paper, we consider the field of equine veterinary health data. We propose a vision that data from the entire global horse population are utilized for the benefit of animal health and longevity. With this in mind, we examine social aspects influencing technical progress. Here, we use a socio-technical matrix as a tool. We reduce the overall complexity by limiting ourselves to the following: Technically, we consider the treasure trove of data from veterinary diagnostics and the Internet of Medical Things (IoMT). Regarding social interactions, we focus on veterinarians and horse owners. Utilizing this socio-technical matrix, we branch out to identify barriers and enablers on the way to the vision. Additional elements, such as the slowly maturing awareness of horse owners regarding the value of these data as well as training of all parties in the handling of data, are identified to be crucial. Abstract There is a consensus that future medicine will benefit from a comprehensive analysis of harmonized, interconnected, and interoperable health data. These data can originate from a variety of sources. In particular, data from veterinary diagnostics and the monitoring of health-related life parameters using the Internet of Medical Things are considered here. To foster the usage of collected data in this way, not only do technical aspects need to be addressed but so do organizational ones, and to this end, a socio-technical matrix is first presented that complements the literature. It is used in an exemplary analysis of the system. Such a socio-technical matrix is an interesting tool for analyzing the process of data sharing between actors in the system dependent on their social relations. With the help of such a socio-technical tool and using equine veterinary medicine as an example, the social system of veterinarians and owners as actors is explored in terms of barriers and enablers of an effective digital representation of the global equine population.
... Recently, unstructured data grows exponentially and most of the healthcare data is unstructured, such as medical prescriptions and electronic medical records. One of the barriers to using unstructured data is that it is difficult to extract useful information from it, while it is helpful to provide novel insights into healthcare systems (Adnan et al., 2020). One emerging research area is applying text mining to analyze textual data. ...
Article
Full-text available
Nowadays, many countries view profitable telemedicine as a viable strategy for meeting healthcare needs, especially during the pandemic. Existing appointment models are based on patients’ structured data. We study the value of incorporating textual patient data into telemedicine appointment optimization. Our research contributes to the healthcare operations management literature by developing a new framework showing (1) the value of the text in the telemedicine appointment problem, (2) the value of incorporating the textual and structured data in the problem. In particular, in the first phase of the framework, a text-driven classification model is developed to classify patients into normal and prolonged service time classes. In the second phase, we integrate the classification model into two existing decision-making policies. We analyze the performance of our proposed policy in the presence of existing methods on a data set from the National Telemedicine Center of China (NTCC). We first show that our classifier can achieve 90.4% AUC in a binary task based on textual data. We next show that our method outperforms the stochastic model available in the literature. In particular, with a slight change of actual distribution from historical data to a normal distribution, we observe that our policy improves the average profit of the policy obtained from the stochastic model by 42% and obtains lower relative regret (18%) from full information than the stochastic model (148%). Furthermore, our policy provides a promising trade-off between the cancellation and postponement rates of patients, resulting in a higher profit and a better schedule strategy for the telemedicine center.
... The potential of NLP in the analysis of EHR data is particularly appealing given the great quantity of data contained in these records. Notwithstanding their importance, such data are intractable with conventional mathematical methods, since they are recorded in clinical reports, prescriptions, annotations on medical images, and generally unstructured texts [8]. ...
Article
Full-text available
Background Recent advances in natural language processing (NLP) have heightened the interest of the medical community in its application to health care in general, in particular to stroke, a medical emergency of great impact. In this rapidly evolving context, it is necessary to learn and understand the experience already accumulated by the medical and scientific community. Objective The aim of this scoping review was to explore the studies conducted in the last 10 years using NLP to assist the management of stroke emergencies so as to gain insight on the state of the art, its main contexts of application, and the software tools that are used. Methods Data were extracted from Scopus and Medline through PubMed, using the keywords “natural language processing” and “stroke.” Primary research questions were related to the phases, contexts, and types of textual data used in the studies. Secondary research questions were related to the numerical and statistical methods and the software used to process the data. The extracted data were structured in tables and their relative frequencies were calculated. The relationships between categories were analyzed through multiple correspondence analysis. ResultsTwenty-nine papers were included in the review, with the majority being cohort studies of ischemic stroke published in the last 2 years. The majority of papers focused on the use of NLP to assist in the diagnostic phase, followed by the outcome prognosis, using text data from diagnostic reports and in many cases annotations on medical images. The most frequent approach was based on general machine learning techniques applied to the results of relatively simple NLP methods with the support of ontologies and standard vocabularies. Although smaller in number, there has been an increasing body of studies using deep learning techniques on numerical and vectorized representations of the texts obtained with more sophisticated NLP tools. Conclusions Studies focused on NLP applied to stroke show specific trends that can be compared to the more general application of artificial intelligence to stroke. The purpose of using NLP is often to improve processes in a clinical context rather than to assist in the rehabilitation process. The state of the art in NLP is represented by deep learning architectures, among which Bidirectional Encoder Representations from Transformers has been found to be especially widely used in the medical field in general, and for stroke in particular, with an increasing focus on the processing of annotations on medical images.
... As a significant component of EMR, clinical notes, which record patients' conditions in free text, provide essential information such as patients' medical history, social history, or lifestyle patterns. Despite being a vital data source, its practical use in medical decision support systems is hampered by challenges in extracting key information from its unstructured text format [6][7][8][9]. For medical institutions in non-English speaking countries, these challenges are compounded by the use of both English and local languages in their clinical notes. ...
Article
Full-text available
As a key modifiable risk factor, alcohol consumption is clinically crucial information that allows medical professionals to further understand their patients’ medical conditions and suggest appropriate lifestyle modifying interventions. However, identifying alcohol-related information from unstructured free-text clinical notes is often challenging. Not only are the formats of the notes inconsistent, but they also include a massive amount of non-alcohol-related information. Furthermore, for medical institutions outside of English-speaking countries, these clinical notes contain both a mixture of English and local languages, inducing additional difficulty in the extraction. Thanks to the increasing availability of electronic medical record (EMR), several previous works explored the idea of using natural language processing (NLP) to train machine learning models that automatically identify alcohol-related information from unstructured clinical notes. However, all these previous works are limited to English clinical notes, thereby able to leverage various large-scale external ontologies during the text preprocessing. Furthermore, they rely on simple NLP techniques such as the bag-of-words models that suffer from high dimensionality and out-of-vocabulary issues. Addressing these issues, we adopt fine-tuning multilingual transformers. By leveraging their linguistically rich contextual information learned during their pre-training, we are able to extract alcohol-related information from unstructured clinical notes without preprocessing the clinical notes on any external ontologies. Furthermore, our work is the first to explore the use of transformers in bilingual clinical notes to extract alcohol-related information. Even with minimal text preprocessing, we achieve extraction accuracy of 84.70% in terms of macro F-1 score.
... Unstructured data does not follow any format or formal data model, which makes it challenging to process and interpret for value extraction. In the medical industry, unstructured healthcare data has tremendous potential to extract valuable insight for improving healthcare service and quality [15]. ...
Article
Full-text available
Multiple blood images of stressed and sheared cells, taken by a Lorrca Ektacytometery microscope, needed a classification for biomedical researchers to assess several treatment options for blood-related diseases. The study proposes the design of a model capable of classifying these images, with high accuracy, into healthy Red Blood Cells (RBCs) or Sickle Cells (SCs) images. The performances of five Deep Learning (DL) models with two different optimizers, namely Adam and Stochastic Gradient Descent (SGD), were compared. The first three models consisted of 1, 2 and 3 blocks of CNN, respectively, and the last two models used a transfer learning approach to extract features. The dataset was first augmented, scaled, and then trained to develop models. The performance of the models was evaluated by testing on new images and was illustrated by confusion matrices, performance metrics (accuracy, recall, precision and f1 score), a receiver operating characteristic (ROC) curve and the area under the curve (AUC) value. The first, second and third models with the Adam optimizer could not achieve training, validation or testing accuracy above 50%. However, the second and third models with SGD optimizers showed good loss and accuracy scores during training and validation, but the testing accuracy did not exceed 51%. The fourth and fifth models used VGG16 and Resnet50 pre-trained models for feature extraction, respectively. VGG16 performed better than Resnet50, scoring 98% accuracy and an AUC of 0.98 with both optimizers. The study suggests that transfer learning with the VGG16 model helped to extract features from images for the classification of healthy RBCs and SCs, thus making a significant difference in performance comparing the first, second, third and fifth models.
... are not yet systematically available or integrated [110] and unstructured data can be difficult to access due to privacy concerns [111,112]. Section 1.2.1 provides a deeper discussion of the challenges of using EHR data for research. [88]. ...
Thesis
Full-text available
Traditional computational phenotypes (CPs) identify patient cohorts without consideration of underlying pathophysiological mechanisms. Deeper patient-level characterizations are necessary for personalized medicine and while advanced methods exist, their application in clinical settings remains largely unrealized. This thesis advances deep CPs through several experiments designed to address four requirements. Stability was examined through three experiments. First, a multiphase study was performed and identified resources and remediation plans as barriers preventing data quality (DQ) assessment. Then, through two experiments, the Harmonized DQ Framework was used to characterize DQ checks from six clinical organizations and 12 biomedical ontologies finding Atemporal Plausibility and Completeness and Value Conformance as the most common clinical checks and Value and Relation Conformance as the most common biomedical ontology checks. Scalability was examined through three experiments. First, a novel composite patient similarity algorithm was developed that demonstrated that information from clinical terminology hierarchies improved patient representations when applied to small populations. Then, ablation studies were performed and showed that the combination of data type, sampling window, and clinical domain used to characterize rare disease patients differed by disease. Finally, an algorithm that losslessly transforms complex knowledge graphs (KGs) into representations more suitable for inductive inference was developed and validated through the generation of expert-verified plausible novel drug candidates. Interoperability was examined through two experiments. First, 36 strategies to align five eMERGE CPs to standard clinical terminologies were examined and revealed lower false negative and positive counts in adults than in pediatric patient populations. Then, hospital-scale mappings between clinical terminologies and biomedical ontologies were developed and found to be accurate, generalizable, and logically consistent. Multimodality was examined through two experiments. A novel ecosystem for constructing ontologically-grounded KGs under alternative knowledge models using different relation strategies and abstraction strategies was created. The resulting KGs were validated through successfully enriching portions of the preeclampsia molecular signature with no previously known literature associations. These experiments were used to develop a joint learning framework for inferring molecular characterizations of patients from clinical data. The utility of this framework was demonstrated through the accurate inference of EHR-derived rare disease patient genotypes/phenotypes from publicly available molecular data.
... In the literature, unstructured data are often referred interchangeably as "big data", "digital data", "unstructured textual data" and described as "highdimensional", "large-scale", "rich", "multivariate" or "raw". 1,3,21,25,26,28 Unstructured data can be utilized on their own or be combined with other data sources to enable data enrichment in health research. In this context, we refer to digital unstructured data enrichment to describe the process of augmenting the available evidence base in health research, which mostly consists of structured data with unstructured data. ...
Preprint
Digital data play an increasingly important role in advancing medical research and care. However, most digital data in healthcare are in an unstructured and often not readily accessible format for research. Specifically, unstructured data are available in a non-standardized format and require substantial preprocessing and feature extraction to translate them to meaningful insights. This might hinder their potential to advance health research, prevention, and patient care delivery, as these processes are resource intensive and connected with unresolved challenges. These challenges might prevent enrichment of structured evidence bases with relevant unstructured data, which we refer to as digital unstructured data enrichment. While prevalent challenges associated with unstructured data in health research are widely reported across literature, a comprehensive interdisciplinary summary of such challenges and possible solutions to facilitate their use in combination with existing data sources is missing. In this study, we report findings from a systematic narrative review on the seven most prevalent challenge areas connected with the digital unstructured data enrichment in the fields of cardiology, neurology and mental health along with possible solutions to address these challenges. Building on these findings, we compiled a checklist following the standard data flow in a research study to contribute to the limited available systematic guidance on digital unstructured data enrichment. This proposed checklist offers support in early planning and feasibility assessments for health research combining unstructured data with existing data sources. Finally, the sparsity and heterogeneity of unstructured data enrichment methods in our review call for a more systematic reporting of such methods to achieve greater reproducibility.
... Estimates indicate that health care data will soon attain the levels of zettabytes and even yottabytes (Glick 2015). Almost 90% of universal data and 60% of medical data are still unstructured and text based (Malmasi et al. 2017;Adnan et al. 2020). It is fundamental to understand that data are useless when they cannot be read, retrieved, analyzed, deciphered, and reused (Obermeyer and Emanuel 2016). ...
Article
Full-text available
Information has become the vital commodity of exchange in recent decades. Medicine is no exception; the importance of patient information in the digital form has been recognized by organizations and health care facilities. Almost all patient information, including medical history, radiographs, and feedback, can be digitally recorded synchronously and asynchronously. Nevertheless, patient information that could be shared and reused to enhance care delivery is not readily available in a format that could be understood by the systems in recipient health care facilities. The systems used in medical and dental clinics today lack the ability to communicate with each other. The critical information is stagnant in isolated silos, unable to be shared, analyzed, and reused. In this article, we propose enabling interoperability in health care systems that could facilitate communication across systems for the benefit of patients and caregivers. We explain in this article the importance of interoperable data, the international interoperability standards available, and the range of benefits and opportunities that interoperability can create in dentistry for providers and patients alike.
... Transformation of such data into structured data requires substantial cleaning, splitting, merging, validating, and sorting, but does improve clinical representation in predictive analytics. 109 Finally, algorithms are not anticipated to completely replace the "subjective" judgment of clinicians involved in the care of the peritransplant patient. 110 For instance, significant technical expertise is required to conduct split liver transplantation, 111 to use donor organs with technical variants or higher risk features, 112 or to successfully transplant patients with complex surgical histories. ...
Article
In this review article, we discuss the model for end-stage liver disease (MELD) score and its dual purpose in general and transplant hepatology. As the landscape of liver disease and transplantation has evolved considerably since the advent of the MELD score, we summarise emerging concepts, methodologies, and technologies that may improve mortality prognostication in the future. Finally, we explore how these novel concepts and technologies may be incorporated into clinical practice.
... Structuring and managing symptom information is a major challenge for research owing to their complex and multidimensional nature. Extracting symptom information from clinical text is critical; for example, for phenotypic classification, clinical diagnosis, or clinical decision support [1][2][3]. More specifically, symptoms are crucial to the assessment and monitoring of the general state of the patient [1,4] and are critical indicators of quality of life for chronically ill patients [5,6]. ...
Article
Full-text available
Background: Automated extraction of symptoms from clinical notes is a challenging task owing to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited owing to the nature of the data containing protected health information. Natural language processing and machine learning to process clinical text for such a task have great potential. However, supervised machine learning requires a great amount of labeled data to train a model, which is at the origin of the main bottleneck in model development. Objective: The aim of this study is to address the lack of labeled data by proposing 2 alternatives to manual labeling for the generation of training labels for supervised machine learning with English clinical text. We aim to demonstrate that using lower-quality labels for training leads to good classification results. Methods: We addressed the lack of labels with 2 strategies. The first approach took advantage of the structured part of electronic health records and used diagnosis codes (International Classification of Disease-10th revision) to derive training labels. The second approach used weak supervision and data programming principles to derive training labels. We propose to apply the developed framework to the extraction of symptom information from outpatient visit progress notes of patients with cardiovascular diseases. Results: We used >500,000 notes for training our classification model with International Classification of Disease-10th revision codes as labels and >800,000 notes for training using labels derived from weak supervision. We show that the dependence between prevalence and recall becomes flat provided a sufficiently large training set is used (>500,000 documents). We further demonstrate that using weak labels for training rather than the electronic health record codes derived from the patient encounter leads to an overall improved recall score (10% improvement, on average). Finally, the external validation of our models shows excellent predictive performance and transferability, with an overall increase of 20% in the recall score. Conclusions: This work demonstrates the power of using a weak labeling pipeline to annotate and extract symptom mentions in clinical text, with the prospects to facilitate symptom information integration for a downstream clinical task such as clinical decision support.
... Understanding this data can improve the quality of social activities e.g. the international situation awareness, conflict resolution, noticing possible crises, future policy planning, etc. [2]. Nevertheless, most of the aforementioned data is unstructured text, thus hard to analyze [3,4]. On the other hand, dealing with such a large amount of data is not easy, making it impractical to derive valuable information by manual reading and analysis [2]. ...
Preprint
Full-text available
Data is published on the web over time in great volumes, but majority of the data is unstructured, making it hard to understand and difficult to interpret. Information Extraction (IE) methods extract structured information from unstructured data. One of the challenging IE tasks is Event Extraction (EE) which seeks to derive information about specific incidents and their actors from the text. EE is useful in many domains such as building a knowledge base, information retrieval, summarization and online monitoring systems. In the past decades, some event ontologies like ACE, CAMEO and ICEWS were developed to define event forms, actors and dimensions of events observed in the text. These event ontologies still have some shortcomings such as covering only a few topics like political events, having inflexible structure in defining argument roles, lack of analytical dimensions, and complexity in choosing event sub-types. To address these concerns, we propose an event ontology, namely COfEE, that incorporates both expert domain knowledge, previous ontologies and a data-driven approach for identifying events from text. COfEE consists of two hierarchy levels (event types and event sub-types) that include new categories relating to environmental issues, cyberspace, criminal activity and natural disasters which need to be monitored instantly. Also, dynamic roles according to each event sub-type are defined to capture various dimensions of events. In a follow-up experiment, the proposed ontology is evaluated on Wikipedia events, and it is shown to be general and comprehensive. Moreover, in order to facilitate the preparation of gold-standard data for event extraction, a language-independent online tool is presented based on COfEE.
... However, there are a number of domain-specific challenges encountered in human healthcare disciplines that are unique to these data frameworks. One example of this is the use of unstructured data, such as patient notes and interpretations of diagnostic tests, which contain rich information that can provide valuable insights at both the individual patient and population levels, but the heterogeneity, variability, and diversity of these data make them difficult to access if analyzed in a controlled manner [3]. Another challenge lies in the issues of privacy and security in human healthcare, which have drawn significant attention in recent years, but are especially important in healthcare settings because of concerns related to the introduction of the HIPAA Privacy Act, which declared medical information, including electronic medical records, to be protected health information covered under the Privacy Rule [4,5]. ...
Article
Full-text available
Simple Summary: Big data has created many opportunities to improve both preventive medicine and medical treatments. In the field of veterinary medical big data, information collected from companion animals, primarily dogs, can be used to inform healthcare decisions in both dogs and other species. Currently, veterinary medical datasets are an underused resource for translational research, but recent advances in data collection in this population have helped to make these data more accessible for use in translational studies. The largest open access dataset in the United States is part of the Dog Aging Project and includes detailed information about individual dog participant's physical and chemical environments, diet, exercise, behavior, and comprehensive health history. These data are collected longitudinally and at regular intervals over the course of the dog's lifespan. Large-scale datasets such as this can be used to inform our understanding of health, disease, and how to increase healthy lifespan. Abstract: Dogs provide an ideal model for study as they have the most phenotypic diversity and known naturally occurring diseases of all non-human land mammals. Thus, data related to dog health present many opportunities to discover insights into health and disease outcomes. Here, we describe several sources of veterinary medical big data that can be used in research. These sources include medical records from primary medical care centers or referral hospitals, medical claims data from animal insurance companies, and datasets constructed specifically for research purposes. No data source provides information that is without limitations, but large-scale, prospective, longitudinally collected data from dog populations are ideal for further research as they offer many advantages over other data sources.
... Different activities such as schema mapping, format transformation, data synthesis, attribute level data are required to understand the context of data. Context-based keyword identification [69], understanding data and its context [77], [23], [79], [63], [70], [24], identification of design level contextual features [78], context awareness for data relations [52], [53], resolving language syncretism issues [17], and identification of contextual weaknesses of unstructured data [62] are contextual determinants that highlights the importance of data context. The user's requirements (functional & non-functional), priorities, and declarative description define the user context involved in the analysis strategy [94]. ...
Article
Full-text available
Unstructured text contains valuable information for a range of enterprise applications and informed decision making. Text analytics is used to extract valuable insights from unstructured big data. Among the most significant challenges of text analytics, quality and usability are critical affecting the outcome of the analytical process. The enhancement in usability is important for the exploitation of unstructured data. Most of the existing literature focuses on the usability of structured data as compared to unstructured data whereas big data usability has been discussed merely in context of its assessment. The existing approaches do not provide proper guidelines on usability enhancement of unstructured data. In this study, a rigorous systematic literature review, using PRISMA framework, has been conducted to develop a model enhancing the usability of unstructured data and bridging the research gap. The recent approaches and solutions for text analytics have been investigated thoroughly. Furthermore, it identifies the usability issues of unstructured text data and their consequences on data preparation for analytics. Defining the usability dimensions for unstructured big data, identification of the usability determinants, and developing a relationship between usability dimension and determinants to derive usability rules are the significant contribution of this research, and are integrated to formulate the model. The proposed usability enhancement model is the major outcome of the study. It would contribute to making unstructured data usable and facilitating the data preparation activities with more valuable data that eventually improves the analytical process.
... However, none of the publications offers a comprehensive overview of all possible challenges as well as opportunities for a particular scenario in terms of volume, velocity, variety and veracity. Case studies in customer-oriented businesses [54,62], the tactical domain [57], healthcare [61,64] and transportation [55] differ in their proposals on how software solutions have to be developed to meet predictable and unpredictable future requirements, in order for the analysis of the data to fulfil the meaning of an additional big data characteristic, the value [54]. ...
Article
Full-text available
Big data attracts researchers and practitioners around the globe in their desire to effectively manage the data deluge resulting from the ongoing evolution of the information systems domain. Consequently, many decision makers attempt to harness the potentials arising with the use of those modern technologies in a multitude of application scenarios. As a result, big data has gained an important role for many businesses. However, as of today, the developed solutions are oftentimes perceived as completed products, without considering that the application in highly dynamic environments might benefit from a deviation of this approach. Relevant data sources as well as the questions that are supposed to be answered by their analysis may change rapidly and so do subsequently the requirements regarding the functionalities of the system. To our knowledge, while big data itself is a prominent topic, fields of application that are likely to evolve in a short period of time and the resulting consequences were not specifically investigated until now. Therefore, this research aims to overcome this paucity by clarifying the relation between dynamic business environments and big data analytics (BDA), sensitizing researchers and practitioners for future big data engineering activities. Apart from a thorough literature review, expert interviews are conducted that evaluate the made inferences regarding dynamic and stable influencing factors, the influence of dynamic environments on BDA applications as well as possible countermeasures. The ascertained insights are condensed into a proposal for decision making, facilitating the alignment of BDA and business needs in dynamic business environments.
... Social media and online news agencies play the main roles in producing this data. However, a vast majority of them is unstructured thus cannot be easily understood [1,2]. Also, the large volume of available data makes it difficult for people to process it in a timely manner. ...
Preprint
Event extraction (EE) is one of the core information extraction tasks, whose purpose is to automatically identify and extract information about incidents and their actors from texts. This may be beneficial to several domains such as knowledge bases, question answering, information retrieval and summarization tasks, to name a few. The problem of extracting event information from texts is longstanding and usually relies on elaborately designed lexical and syntactic features, which, however, take a large amount of human effort and lack generalization. More recently, deep neural network approaches have been adopted as a means to learn underlying features automatically. However, existing networks do not make full use of syntactic features, which play a fundamental role in capturing very long-range dependencies. Also, most approaches extract each argument of an event separately without considering associations between arguments which ultimately leads to low efficiency, especially in sentences with multiple events. To address the two above-referred problems, we propose a novel joint event extraction framework that aims to extract multiple event triggers and arguments simultaneously by introducing shortest dependency path (SDP) in the dependency graph. We do this by eliminating irrelevant words in the sentence, thus capturing long-range dependencies. Also, an attention-based graph convolutional network is proposed, to carry syntactically related information along the shortest paths between argument candidates that captures and aggregates the latent associations between arguments; a problem that has been overlooked by most of the literature. Our results show a substantial improvement over state-of-the-art methods.
Chapter
This Paper delves into the profound impact of data integration on talent management within the evolving landscape of technological advancements, particularly focusing on sustainable practices in human resources. Traditionally rooted in societal norms and organizational culture, talent management has developed from conventional approaches to leveraging a variety of data sources, a transformation that enriches sustainable HR strategies. Through a narrative literature review, the study maps the development of HR data sources, reinforcing the interplay between structured and unstructured data. It examines the digitalization of HR, not only tracing the progression of Human Resource Information Systems (HRIS) but also highlighting their role in fostering sustainable workforce management. The incorporation of advanced technologies such as machine learning and natural language processing is scrutinized, assessing their effects on the efficiency and environmental considerations of HR processes.
Chapter
Gathering significant amounts of information from different sources and formats is a major real-world challenge. Small amounts of data can only be stored using traditional databases. The old database management systems struggle to extract knowledge from unorganized data. It becomes essential to manage both organized and unorganized data while creating an effective system. Big data technology, which can derive insights from both structured and unstructured data, solves this problem. The goal of big data is to compile information from multiple places and store it in a central location. There are many opportunities to provide high quality patient care at reasonable costs when big data analytics is used in the healthcare industry. Additionally, the proliferation of mobile and internet connected devices has sparked an information explosion. Large amounts of data cannot be handled by current frameworks; new creative approaches must be utilized. This chapter outlines data analytics for public health and its role in health systems overall.
Chapter
Pressures to enhance healthcare sector institutions' effectiveness have increased in recent years. Accordingly, healthcare institutions have started to employ big data technologies to achieve low-cost optimization of higher-quality products and services. Further, advanced IT systems play a critical role by serving as a centralized system for managing the healthcare industry's big data in the Sultanate of Oman. This study explores the impact of big data awareness on healthcare institutions' performance in Oman. A questionnaire was distributed to employees working in healthcare institutions in Oman; the collected data were analyzed using WarpPLS software as an application of structural equation modeling to test the proposed theoretical model. The final sample size included 148 participants. The results indicated that the knowledge of big data's features and recognition of big data's challenges were significant predictors of health institutions' performance (HCIP). In contrast, insights into big data applications and familiarity with the concept of big data did not show a significant impact on HCIP. Findings may help policymakers and healthcare sector executives learn the significance of different big data awareness dimensions in boosting Omani healthcare institutions' effectiveness.
Article
Full-text available
The task of insights extraction from unstructured text poses significant challenges for big data analytics because it contains subjective intentions, different contextual perspectives, and information about the surrounding real world. The technical and conceptual complexities of unstructured text degrade its usability for analytics. Unlike structured data, the existing literature lacks solutions to address the usability of unstructured text big data. A usability enhancement model has been developed to address this research gap, incorporating various usability dimensions, determinants, and rules as key components. This paper adopted Delphi technique to validate the usability enhancement model to ensure its correctness, confidentiality, and reliability. The primary goal of model validation is to assess the external validity and suitability of the model through domain experts and professionals. Therefore, the subject matter experts of industry and academia from different countries were invited to this Delphi, which provides more reliable and extensive opinions. A multistep iterative process of Knowledge Resource Nomination Worksheet (KRNW) has been adopted for expert identification and selection. Average Percent of Majority Opinions (APMO) method has been used to produce the cut-off rate to determine the consensus achievement. The consensus was not achieved after the first round of Delphi, whereas APMO cut-off rate was 70.9%. The model has been improved based on the opinions of 10 subject matter experts. After second round, the analysis has shown majority agreement for the revised model and consensus achievement for all improvements that validate the improved usability enhancement model. The final proposed model provides a systematic and structured approach to enhance the usability of unstructured text big data. The outcome of the research is significant for researchers and data analysts.
Chapter
Full-text available
Artificial Intelligence (AI) has played a significant role in improving decision-making within the healthcare system. AI includes machine learning, which encompasses a subset called artificial neural networks (ANNs). These networks mimic how biological neurons in the brain signal one another. In this chapter, we conduct a seminal review of ANNs and explain how prediction and classification tasks can be conducted in the field of medicine. Basic information is provided showing how neural networks solve the problem of determining disease subsets by analyzing huge amounts of structured and unstructured patient data. We also provide information on the application of conventional ANNs and deep convolutional neural networks (DCNNs) that are specific to medical image processing. For example, DCNNs can be used to detect the edges of an item within an image. The acquired knowledge can then be transferred so that similar edges can be identified on another image. This chapter is unique; it is specifically aimed at medical professionals who are interested in artificial intelligence. Because we will demonstrate the application in a straightforward manner , researchers from other technical fields will also benefit.
Chapter
Natural language processing (NLP) is the subfield of artificial intelligence that has the potential to make human language analyzable by computers. NLP is increasingly proving its importance in the medical field where a huge amount of data remains unstructured (free text) stored as electronic medical records (EMR); discharge summaries, lab reports, clinical notes, pa-thology reports, etc. Traditional Machine learning (ML) based approaches have been widely used for medical NLP tasks, but these methods require a set of manual work and still suffer in terms of accuracy. However, deep learning (DL) based methods have made significant improvement. The main goal of this study is to present the state-of-the-art DL based NLP tech-niques in healthcare. We started by presenting word embedding techniques and popular deep learning models used in this area, and then reviewed ap-plications of NLP tasks in medical domain such as classification, predic-tion, and information extraction. We concluded our study with analyzing cited architectures and showing the promising results of CNN and BiLSTM and BERT fine-tuning.KeywordsNatural language processingDeep LearningWord EmbeddingNeural networksMedical textCNNBiLSTMBERT
Article
Full-text available
Electronic medical records (EMRs) help in identifying disease archetypes and progression. A very important part of EMRs is the presence of time domain data because these help with identifying trends and monitoring changes through time. Most time-series data come from wearable devices monitoring real-time health trends. This review focuses on the time-series data needed to construct complete EMRs by identifying paradigms that fall within the scope of the application of artificial intelligence (AI) based on the principles of translational medicine. (1) Background: The question addressed in this study is: What are the taxonomies present in the field of the application of machine learning on EMRs? (2) Methods: Scopus, Web of Science, and PubMed were searched for relevant records. The records were then filtered based on a PRISMA review process. The taxonomies were then identified after reviewing the selected documents; (3) Results: A total of five main topics were identified, and the subheadings are discussed in this review; (4) Conclusions: Each aspect of the medical data pipeline needs constant collaboration and update for the proposed solutions to be useful and adaptable in real-world scenarios.
Article
Full-text available
Öz: Bu çalışmanın temel amacı sağlık bakım hizmetleri alanında yapılan inovasyon konulu makalelerin bilim haritalama teknikleri ile incelemektir. R tabanlı Bibliometrix ve VOSviwer yazılımları kullanılarak, en etkili yazar, ülke, kurum ve dergiler tespit edilmiştir. Bir arama staretisi ile 1975-2019 yılları arasındaki Web of Science makaleleri Core koleksiyondan indirilmiş ve Bibliyometrix ve VOSviwer yazılımları ile analiz edilmiştir. Sadece bir yazılım ile inceleme yapılmamasının en büyük sebebi ise her bir yazılımın öne çıktığı ve diğer yazılımlarda olmayan karşılaştırmalı üstünlüktür. İnovasyon konusunda en çok makale yayınlayan ülkeler arasında ABD, Kanada, İngiltere, Hollanda ve Fransanın yer aldığı ve Health Policy dergisinin bu konuda en fazla makale yayınlayan dergi olduğu yapılan analizler sonucunda ortaya çıkmıştır. Ayrıca h ve g indeksleri en yüksek yazar Weiner olarak tespit edilmiştir. Teknolojik gelişmelerin etkisi ile ortaya çıkan sağlık inovasyonundaki gelişmeler, hasta yaşam beklentisini ve yaşam kalitesini başarılı bir şekilde arttırmaktadır. Sağlık bakım hizmeleri alanında sunulan bakım, tedavi ve teşhis birçok konuda verimliliği arttırdığından maliyetlerin azalmasına ve insan hatalarını minimize etmeye yardımcı olmaktadır. Yapılan bu çalışma ile klinisyenler, sağlık hizmeti sağlayıcıları-tedarikçiler, araştırmacılar, politika yapıcılar, karar vericiler ve hastalar ortaya çıkan verilerin analizinden elde edilen yeni bilgiler ışığında yeni fırsatlar elde edebileceklerdir. Anahtar Kelimeler: Sağlık Bakım Hizmetleri, İnovasyon, Bibliyometrik Analiz. Abstract: The main purpose of this study is to examine the innovation articles in the field of health care services with science mapping techniques. Using R-based Bibliometrix and VOSviwer software, the most influential authors, countries, institutions and journals were determined. Web of Science articles between 1975 and 2019 were downloaded from the Core collection with a search engine and analyzed with Bibliometrix and VOSviwer software. The biggest reason for not reviewing with only one software is the comparative advantage in which each software stands out and is not available in other software. It has emerged as a result of the analysis that the USA, Canada, England, Netherlands and France are among the countries that publish the most articles on innovation, and that the Health Policy journal is the journal that publishes the most articles on this subject. In addition, the h and g indexes were determined as the highest author Weiner. Developments in health innovation, which emerged with the effect of technological developments, successfully increase patient life expectancy and quality of life. Care provided in the field of health care services; As it increases efficiency in many areas such as treatment and diagnosis, it helps to reduce costs and minimize human errors. With this study, clinicians, healthcare providers-suppliers, researchers, policy makers, decision makers and patients will be able to gain new opportunities in the light of new information obtained from the analysis of the emerging data. Keywords: Health Care Services, Innovation, Bibliometric Analysis.
Chapter
More than 80% of healthcare data is unstructured. The complexity and challenges in healthcare data demands a methodical approach for digital transformation. The Process, Enablement, Tooling, and Synthesis (PETS) method is presented, which provides a holistic approach and discipline to help organizations do it right the first time on digital transformation of unstructured data in the healthcare domain. PETS establishes and evolves a comprehensive knowledgebase for the technology facilitation and implementations in the new era. Details of PETS modules and open-source solutions are discussed. Best practices and real-world PETS applications are articulated in the context.KeywordsDigital transformationHealthcareUnstructured dataProcessEnablementToolingSynthesisBest practiceApplicationOpen sourceDiscipline
Conference Paper
Full-text available
Abstract—This paper aims to present an Information Extraction (IE) system for extracting knowledge from medical records. The process of extracting data from unstructured text sources is known as information extraction. Medical records were gathered from Gaza hospitals that written in both Arabic and English languages. Then, a model was defined for converting unstructured medical text into structured form. Furthermore, association rules were used to create useful rules from structured data. These rules can be used to assist medical personnel in detecting hidden relationships in medical data and making decisions that can enhance patient care. The paper proposed two approaches to assess our work: objective and subjective. For the objective assessment of association rules, support and confidence measures were used. A questionnaire was used to evaluate the produced rules by medical experts for subjective evaluation. The produced rules were found to be useful by 87% of the medical experts.
Article
Natural language processing (NLP) is a computerized approach to analyzing text that explores how computers can be used to understand and manipulate natural language text or speech to do useful things. In healthcare field, these NLP techniques are applied in a variety of applications, ranging from evaluating the adequacy of treatment, assessing the presence of the acute illness, and the other clinical decision support. After converting text into computer-readable data through the text preprocessing process, an NLP can extract valuable information using the rule-based algorithm, machine learning, and neural network. We can use NLP to distinguish subtypes of stroke or accurately extract critical clinical information such as severity of stroke and prognosis of patients, etc. If these NLP methods are actively utilized in the future, they will be able to make the most of the electronic health records to enable optimal medical judgment.
Article
Event extraction (EE) is one of the core information extraction tasks, whose purpose is to automatically identify and extract information about incidents and their actors from texts. This may be beneficial to several domains such as knowledge base construction, question answering and summarization tasks, to name a few. The problem of extracting event information from texts is longstanding and usually relies on elaborately designed lexical and syntactic features, which, however, take a large amount of human effort and lack generalization. More recently, deep neural network approaches have been adopted as a means to learn underlying features automatically. However, existing networks do not make full use of syntactic features, which play a fundamental role in capturing very long-range dependencies. Also, most approaches extract each argument of an event separately without considering associations between arguments which ultimately leads to low efficiency, especially in sentences with multiple events. To address the above-referred problems, we propose a novel joint event extraction framework that aims to extract multiple event triggers and arguments simultaneously by introducing shortest dependency path in the dependency graph. We do this by eliminating irrelevant words in the sentence, thus capturing long-range dependencies. Also, an attention-based graph convolutional network is proposed, to carry syntactically related information along the shortest paths between argument candidates that captures and aggregates the latent associations between arguments; a problem that has been overlooked by most of the literature. Our results show a substantial improvement over state-of-the-art methods on two datasets, namely ACE 2005 and TAC KBP 2015.
Article
Full-text available
Background: Traditional health information systems are generally devised to support clinical data collection at the point of care. However, as the significance of the modern information economy expands in scope and permeates the healthcare domain, there is an increasing urgency for healthcare organisations to offer information systems that address the expectations of clinicians, researchers and the business intelligence community alike. Amongst other emergent requirements, the principal unmet need might be defined as the 3R principle (right data, right place, right time) to address deficiencies in organisational data flow while retaining the strict information governance policies that apply within the UK National Health Service (NHS). Here, we describe our work on creating and deploying a low cost structured and unstructured information retrieval and extraction architecture within King's College Hospital, the management of governance concerns and the associated use cases and cost saving opportunities that such components present. Results: To date, our CogStack architecture has processed over 300 million lines of clinical data, making it available for internal service improvement projects at King's College London. On generated data designed to simulate real world clinical text, our de-identification algorithm achieved up to 94% precision and up to 96% recall. Conclusion: We describe a toolkit which we feel is of huge value to the UK (and beyond) healthcare community. It is the only open source, easily deployable solution designed for the UK healthcare environment, in a landscape populated by expensive proprietary systems. Solutions such as these provide a crucial foundation for the genomic revolution in medicine.
Article
Full-text available
Healthcare quality research is a fundamental task that involves assessing treatment patterns and measuring the associated patient outcomes to identify potential areas for improving healthcare. While both qualitative and quantitative approaches are used, a major obstacle for the quantitative approach is that many useful healthcare quality indicators are buried within provider narrative notes, requiring expensive and laborious manual chart review to identify and measure them. Information extraction is a key Natural Language Processing (NLP) task for discovering and mining critical knowledge buried in unstructured clinical data. Nevertheless, widespread adoption of NLP has yet to materialize; the technical skills required for the development or use of such software present a major barrier for medical researchers wishing to employ these methods. In this paper we introduce Canary, a free and open source solution designed for users without NLP and technical expertise and apply it to four tasks, aiming to measure the frequency of: (1) insulin decline; (2) statin medication decline; (3) adverse reactions to statins; and (3) bariatric surgery counselling. Our results demonstrate that this approach facilitates mining of unstructured data with high accuracy, enabling the extraction of actionable healthcare quality insights from free-text data sources.
Article
Full-text available
Coupled with the rise of data science and machine learning, the increasing availability of digitized health and wellness data has provided an exciting opportunity for complex analyses of problems throughout the healthcare domain. Whereas many early works focused on a particular aspect of patient care, often drawing on data from a specific clinical or administrative source, it has become clear such a single-source approach is insufficient to capture the complexity of the human condition. Instead, adequately modeling health and wellness problems requires the ability to draw upon data spanning multiple facets of an individual’s biology, their care, and the social aspects of their life. Although such an awareness has greatly expanded the breadth of health and wellness data collected, the diverse array of data sources and intended uses often leave researchers and practitioners with a scattered and fragmented view of any particular patient. As a result, there exists a clear need to catalogue and organize the range of healthcare data available for analysis. This work represents an effort at developing such an organization, presenting a patient-centric framework deemed the Healthcare Data Spectrum (HDS). Comprised of six layers, the HDS begins with the innermost micro-level omics and macro-level demographic data that directly characterize a patient, and extends at its outermost to aggregate population-level data derived from attributes of care for each individual patient. For each level of the HDS, this manuscript will examine the specific types of constituent data, provide examples of how the data aid in a broad set of research problems, and identify the primary terminology and standards used to describe the data.
Article
Full-text available
Machine learning-based patient monitoring systems are generally deployed on remote servers for analyzing heterogeneous data. While recent advances in mobile technology provide new opportunities to deploy such systems directly on mobile devices, the development and deployment challenges are not being extensively studied by the research community. In this paper, we systematically investigate challenges associated with each stage of the development and deployment of a machine learning-based patient monitoring system on a mobile device. For each class of challenges, we provide a number of recommendations that can be used by the researchers, system designers, and developers working on mobile-based predictive and monitoring systems. The results of our investigation show that when developers are dealing with mobile platforms, they must evaluate the predictive systems based on its classification and computational performance. Accordingly, we propose a new machine learning training and deployment methodology specifically tailored for mobile platforms that incorporates metrics beyond traditional classifier performance.
Article
Full-text available
Background and objective In the medical field, data volume is increasingly growing, and traditional methods cannot manage it efficiently. In biomedical computation, the continuous challenges are: management, analysis, and storage of the biomedical data. Nowadays, big data technology plays a significant role in the management, organization, and analysis of data, using machine learning and artificial intelligence techniques. It also allows a quick access to data using the NoSQL database. Thus, big data technologies include new frameworks to process medical data in a manner similar to biomedical images. It becomes very important to develop methods and/or architectures based on big data technologies, for a complete processing of biomedical image data. Method This paper describes big data analytics for biomedical images, shows examples reported in the literature, briefly discusses new methods used in processing, and offers conclusions. We argue for adapting and extending related work methods in the field of big data software, using Hadoop and Spark frameworks. These provide an optimal and efficient architecture for biomedical image analysis. This paper thus gives a broad overview of big data analytics to automate biomedical image diagnosis. A workflow with optimal methods and algorithm for each step is proposed. Results Two architectures for image classification are suggested. We use the Hadoop framework to design the first, and the Spark framework for the second. The proposed Spark architecture allows us to develop appropriate and efficient methods to leverage a large number of images for classification, which can be customized with respect to each other. Conclusions The proposed architectures are more complete, easier, and are adaptable in all of the steps from conception. The obtained Spark architecture is the most complete, because it facilitates the implementation of algorithms with its embedded libraries.
Article
Full-text available
The growth of online health communities particularly those involving socially generated content can provide considerable value for society. Participants can gain knowledge of medical information or interact with peers on medical forum platforms. Analysing sentiment expressed by members of a health community in medical forum discourse can be of significant value, such as by identifying a particular aspect of an information space, determining themes that predominate among a large data set, and allowing people to summarize topics within a big data set. In this paper, we identify sentiments expressed in online medical forums that discuss Lyme disease. There are two goals in our research: first, to identify a complete and relevant set of categories that can characterize Lyme disease discourse; and second, to test and investigate strategies, both individually and collectively, for automating the classification of medical forum posts into those categories. We present a feature-based model that consists of three different feature sets: content-free, content-specific and meta-level features. Employing inductive learning algorithms to build a feature-based classification model, we assess the feasibility and accuracy of our automated classification. We further evaluate our model by assessing its ability to adapt to an online medical forum discussing Lupus disease. The experimental results demonstrate the effectiveness of our approach.
Chapter
Full-text available
The proliferation of mobile technologies has paved the way for the widespread use of mobile health (mHealth) devices. This in turn generates a large amount of data, which is essentially big data, that can be used for various purposes. In order to obtain the maximum benefit from mHealth data, emerging big data technologies can be employed. In this chapter, the relationship between mHealth and big data is investigated from a sociotechnical perspective. Following an overview of the state-of-the-art, stakeholders and their interests are identified, and the impact of big data on such interests is presented. The opportunities of using big data technologies in the mHealth domain are considered from several viewpoints. Social and economic implications of using big data technologies toward these ends are highlighted. Various challenges exist in the implementation and adoption of mHealth data processing. While there are social challenges including privacy, safety, and a false sense of confidence, there are also technical challenges such as security, standardization, correctness, timely analysis, and domain expertise. Some of these coincide with the challenges of the big data domain, and the others are related to human nature and human capabilities. The use of existing big data platforms requires significant expertise and know-how in data science domain which may hinder the adoption of big data technologies in mHealth. Hence, a solution in the form of a framework that provides higher abstraction level programming models is suggested to facilitate widespread user adoption. Accordingly, user aspects associated with big data in the mHealth domain are discussed.
Article
Full-text available
With the development of big data computing technology, most documents in various areas, including politics, economics, society, culture, life, and public health, have been digitalized. The structure of conventional documents differs according to their authors or the organization that generated them. Therefore, policies and studies related to their efficient digitalization and use exist. Text mining is the technology used to classify, cluster, extract, search, and analyze data to find patterns or features in a set of unstructured or structured documents written in natural language. In this paper, a method for extracting associative feature information using text mining from health big data is proposed. Using health documents as raw data, health big data are created by means of the Web. The useful information contained in health documents is extracted through text mining. Health documents as raw data are collected through Web scraping and then saved in a file server. The collected raw data of health documents are sentence type, and thus morphological analysis is applied to create a corpus. The file server executes stop word removal, tagging, and the analysis of polysemous words in a preprocessing procedure to create a candidate corpus. TF-C-IDF is applied to the candidate corpus to evaluate the importance of words in a set of documents. The words classified as of high importance by TF-C-IDF are included in a set of keywords, and the transactions of each document are created. Using an Apriori mining algorithm, the association rules of keywords in the created transaction are analyzed and associative keywords are generated. TF-C-IDF weights and associative keywords are extracted from health big data as associative features. The proposed method is a base technology for creating added value in the healthcare industry in the era of the 4th industrial revolution. Its evaluation in terms of F-measure and efficiency showed its performance to be high. The method is expected to contribute to healthcare big data management and information search.
Article
Full-text available
Recent snapshots of the European progress on big data in health care and precision medicine reveal diverse perceptions of experts and the public, leading to the impression that algorithmic issues have the largest share among the challenges all health systems are faced with. Yet, from a comparison of different countries it is evident that the adaption and integration of heterogeneous data sources have a major impact on the advancement of precision medicine. Legal regulations for implementation and operation of healthcare networking are actively discussed in the public and gradually implemented in several countries. Based on a unified documentation, they are a perfect precondition for integrating distributed healthcare data to a big data platform with a reliable fact representation. Now, basic and clinical scientists have to be motivated to share their work with these data platforms. In this work, we aim to provide an overview on the common issues in big healthcare data applications and address the challenges for the involved scientific, clinical and administrative partners. We propose a possible strategy for a comprehensive data integration by iterating data harmonization, semantic enrichment and data analysis processes.
Article
Full-text available
Data seldom create value by themselves. They need to be linked and combined from multiple sources, which can often come with variable data quality. The task of improving data quality is a recurring challenge. In this paper, we use a case study of a large telecom company to develop a generic process pattern model for improving data quality. The process pattern model is defined as a proven series of activities, aimed at improving the data quality given a certain context, a particular objective, and a specific set of initial conditions. Four different patterns are derived to deal with the variations in data quality of datasets. Instead of having to find the way to improve the quality of big data for each situation, the process model provides data users with generic patterns, which can be used as a reference model to improve big data quality.
Article
Full-text available
This paper presents the application of text mining methods to the texts in electronic health records (EHR). It is shown in an experimental study how to raise the data possibility to reflect the real medical processes for process modeling tasks. The method is based on the patterns identified in the analysis of medical databases with the physician assistant. EHR is characterized by the gap between common semantic structure and syntactic structure what is important for complex processes modeling. This study aimed at the solution of the problem of knowledge retrieval from EHR by identifying the specifics of their semantic structure and the development of algorithms for interpretation of medical records using the text mining. The medical tests description, surgery protocols, and other medical documents contain many extremely important items for the process analysis. By automating the retrieval of significant data from EHR can be also used for knowledge bases filling. Moreover, the proposed method is developed during the study of actual Russian language medical data of Acute coronary syndrome (ACS) patients from the current specialized medical center which also valuable. The efficiency of this method is demonstrated in the course of correlation analysis of comorbidities on the treatment duration of ACS and in the case of extracted data using to develop process models with complexity metrics at the control-flow perspective of process mining techniques.
Article
Full-text available
Context One of the main targets of cyber-attacks is data exfiltration, which is the leakage of sensitive or private data to an unauthorized entity. Data exfiltration can be perpetrated by an outsider or an insider of an organization. Given the increasing number of data exfiltration incidents, a large number of data exfiltration countermeasures have been developed. These countermeasures aim to detect, prevent, or investigate exfiltration of sensitive or private data. With the growing interest in data exfiltration, it is important to review data exfiltration attack vectors and countermeasures to support future research in this field. Objective This paper is aimed at identifying and critically analysing data exfiltration attack vectors and countermeasures for reporting the status of the art and determining gaps for future research. Method We have followed a structured process for selecting 108 papers from seven publication databases. Thematic analysis method has been applied to analyse the extracted data from the reviewed papers. Results We have developed a classification of (1) data exfiltration attack vectors used by external attackers and (2) the countermeasures in the face of external attacks. We have mapped the countermeasures to attack vectors. Furthermore, we have explored the applicability of various countermeasures for different states of data (i.e., in use, in transit, or at rest). Conclusion This review has revealed that (a) most of the state of the art is focussed on preventive and detective countermeasures and significant research is required on developing investigative countermeasures that are equally important; (b) Several data exfiltration countermeasures are not able to respond in real-time, which specifies that research efforts need to be invested to enable them to respond in real-time (c) A number of data exfiltration countermeasures do not take privacy and ethical concerns into consideration, which may become an obstacle in their full adoption (d) Existing research is primarily focussed on protecting data in ‘in use’ state, therefore, future research needs to be directed towards securing data in ‘in rest’ and ‘in transit’ states (e) There is no standard or framework for evaluation of data exfiltration countermeasures. We assert the need for developing such an evaluation framework.
Article
Full-text available
Background: Geriatric syndromes, including frailty, are common in older adults and associated with adverse outcomes. We compared patients described in clinical notes as "frail" to other older adults with respect to geriatric syndrome burden and healthcare utilization. Methods: We conducted a retrospective cohort study on 18,341 Medicare Advantage enrollees aged 65+ (members of a large nonprofit medical group in Massachusetts), analyzing up to three years of administrative claims and structured and unstructured electronic health record (EHR) data. We determined the presence of ten geriatric syndromes (falls, malnutrition, dementia, severe urinary control issues, absence of fecal control, visual impairment, walking difficulty, pressure ulcers, lack of social support, and weight loss) from claims and EHR data, and the presence of frailty descriptions in clinical notes with a pattern-matching natural language processing (NLP) algorithm. Results: Of the 18,341 patients, we found that 2202 (12%) were described as "frail" in clinical notes. "Frail" patients were older (82.3 ± 6.8 vs 75.9 ± 5.9, p < .001) and had higher rates of healthcare utilization, including number of inpatient hospitalizations and emergency department visits, than the rest of the population (p < .001). "Frail" patients had on average 4.85 ± 1.72 of the ten geriatric syndromes studied, while non-frail patients had 2.35 ± 1.71 (p = .013). Falls, walking difficulty, malnutrition, weight loss, lack of social support and dementia were more highly correlated with frailty descriptions. The most common geriatric syndrome pattern among "frail" patients was a combination of walking difficulty, lack of social support, falls, and weight loss. Conclusions: Patients identified as "frail" by providers in clinical notes have higher rates of healthcare utilization and more geriatric syndromes than other patients. Certain geriatric syndromes were more highly correlated with descriptions of frailty than others.
Conference Paper
Full-text available
Readmission rate is a quality metric for hospitals. The electronic medical record is the main source to identify readmitted patients and calculating readmission rates. Difficulties remain in identifying patients readmitted to a facility different than the one performing the procedure. In this study, we assessed the impact of using unstructured data in detecting readmission within 30 days of surgery. We implemented two rule-based systems to recognize any mention of readmission in follow-up phone call conversions. We evaluated our systems on datasets from two hospitals. Our evaluation showed using unstructured data, in addition to structured data, increased sensitivity in the both dataset, from 53 to 81 and 66 to 87 percent.
Article
Full-text available
Background Korian is a private group specializing in medical accommodations for elderly and dependent people. A professional data warehouse (DWH) established in 2010 hosts all of the residents’ data. Inside this information system (IS), clinical narratives (CNs) were used only by medical staff as a residents’ care linking tool.The objective of this study was to show that, through qualitative and quantitative textual analysis of a relatively small physiotherapy and well-defined CN sample, it was possible to build a physiotherapy corpus and, through this process, generate a new body of knowledge by adding relevant information to describe the residents’ care and lives. Methods Meaningful words were extracted through Standard Query Language (SQL) with the LIKE function and wildcards to perform pattern matching, followed by text mining and a word cloud using R® packages. Another step involved principal components and multiple correspondence analyses, plus clustering on the same residents’ sample as well as on other health data using a health model measuring the residents’ care level needs. ResultsBy combining these techniques, physiotherapy treatments could be characterized by a list of constructed keywords, and the residents’ health characteristics were built. Feeding defects or health outlier groups could be detected, physiotherapy residents’ data and their health data were matched, and differences in health situations showed qualitative and quantitative differences in physiotherapy narratives. Conclusions This textual experiment using a textual process in two stages showed that text mining and data mining techniques provide convenient tools to improve residents’ health and quality of care by adding new, simple, useable data to the electronic health record (EHR). When used with a normalized physiotherapy problem list, text mining through information extraction (IE), named entity recognition (NER) and data mining (DM) can provide a real advantage to describe health care, adding new medical material and helping to integrate the EHR system into the health staff work environment.
Article
Full-text available
The aim of this study is an integrated research for text-based data mining and toxicity prediction modeling system for clinical decision support system based on big data in radiation oncology as a preliminary research. The structured and unstructured data were prepared by treatment plans and the unstructured data were extracted by dose-volume data image pattern recognition of prostate cancer for research articles crawling through the internet. We modeled an artificial neural network to build a predictor model system for toxicity prediction of organs at risk. We used a text-based data mining approach to build the artificial neural network model for bladder and rectum complication predictions. The pattern recognition method was used to mine the unstructured toxicity data for dose-volume at the detection accuracy of 97.9%. The confusion matrix and training model of the neural network were achieved with 50 modeled plans (n = 50) for validation. The toxicity level was analyzed and the risk factors for 25% bladder, 50% bladder, 20% rectum, and 50% rectum were calculated by the artificial neural network algorithm. As a result, 32 plans could cause complication but 18 plans were designed as non-complication among 50 modeled plans. We integrated data mining and a toxicity modeling method for toxicity prediction using prostate cancer cases. It is shown that a preprocessing analysis using text-based data mining and prediction modeling can be expanded to personalized patient treatment decision support based on big data.
Article
Full-text available
Coronary Artery Disease (CAD) is not only the most common form of heart disease, but also the leading cause of death in both men and women[1]. We present a system that is able to automatically predict whether patients develop coronary artery disease based on their narrative medical histories, i.e., clinical free text. Although the free text in medical records has been used in several studies for identifying risk factors of coronary artery disease, to the best of our knowledge our work marks the first attempt at automatically predicting development of CAD. We tackle this task on a small corpus of diabetic patients. The size of this corpus makes it important to limit the number of features in order to avoid overfitting. We propose an ontology-guided approach to feature extraction, and compare it with two classic feature selection techniques. Our system achieves state-of-the-art performance of 77.4% F1 score.
Article
Full-text available
Precision medicine requires clinical trials that are able to efficiently enroll subtypes of patients in whom targeted therapies can be tested. To reduce the large amount of time spent screening, identifying, and recruiting patients with specific subtypes of heterogeneous clinical syndromes (such as heart failure with preserved ejection fraction [HFpEF]), we need prescreening systems that are able to automate data extraction and decision-making tasks. However, a major obstacle is the vast amount of unstructured free-form text in medical records. Here we describe an information extraction-based approach that automatically converts unstructured text into structured data, which is cross-referenced against eligibility criteria using a rule-based system to determine which patients qualify for a major HFpEF clinical trial (PARAGON). We show that we can achieve a sensitivity and positive predictive value of 0.95 and 0.86, respectively. Our open-source algorithm could be used to efficiently identify and subphenotype patients with HFpEF and other disorders.
Article
Full-text available
A big data analytics enabled transformation model based on practice-based view is developed which reveals the causal relationships among big data analytics capabilities, IT-enabled transformation practices, benefit dimensions and business value. This model was then tested in healthcare setting. Through analyzing big data implementation cases, we sought to understand how big data analytics capabilities transform organizational practices, thereby generating potential benefits. In addition to conceptually defining four big data analytics capabilities, the model offers a strategic view of big data analytics. Three significant path-to-value chains were identified for healthcare organizations by applying the model, which provides practical insights for managers.
Article
Full-text available
The concept of big data is now treated from different points of view covering its implications in many fields remarkably including healthcare. To achieve the wealth of health information, integrating, sharing and availing data are the essential tasks that ultimately demand the concept of distributed system. However, privacy and security of data are the matter of concern, as data need to be accessed from various locations in the distributed system. The present study first provides a broad overview on big data and the effectiveness of healthcare big data for non-expert readers. Then, this article builds a distributed framework of organized healthcare model for the purpose of protecting patient data.
Article
Full-text available
Technological advances in information-communication technologies in the health ecosystem have allowed for the recording and consumption of massive amounts of structured and unstructured health data. In developing countries, the use of Electronic Medical Records (EMR) is necessary to address the need for efficient delivery of services and informed decision-making, especially at the local level where health facilities and practitioners may be lacking. Text mining is a variation of data mining that tries to extract non-trivial information and knowledge from unstructured text. This study aims to determine the feasibility of integrating an intelligent agent within EMRs for automatic diagnosis prediction based on the unstructured clinical notes. A Multilayer Feed-Forward Neural Network with Back Propagation training was implemented for classification. The two neural network models predicted hypertension against similar diagnoses with 11.52% and 10.53% percent errors but predicted with 54.01% and 64.82% percent errors when used on a group of similar diagnoses. Further development is needed for prediction of diagnoses with common symptoms and related diagnoses. The results still prove, however, that unstructured data possesses value beneficial for clinical decision support. If further analyzed with structured data, a more accurate intelligent agent may be explored.
Article
Full-text available
Distinguishing migraine from stroke is a challenge due to many common signs and symptoms. It is important to consider the cost of hospitalization and the time spent by neurologists and stroke nurses to visit, diagnose, and assign appropriate care to the patients; therefore, devising new ways to distinguish stroke, migraine and other types of mimics can help in saving time and cost, and improve decision-making. In this study, we utilized text and data mining methods to extract the most important predictors from clinical reports in order to establish a migraine detection model and distinguish migraine patients from stroke or other types of mimic (non-stroke) cases. The available data for this study was a heterogeneous mix of free-text fields, such as triage main-complaints and specialist final-impressions, as well as numeric data about patients, such as age, blood-pressure, and so on. After a careful combination of these sources, we obtained a highly imbalanced dataset where the migraine cases were only about 6 % of the dataset. Our main challenge was tackling this data imbalance. Using the dataset in its original form to build classifiers led to a learning bias towards the majority class and against the minority (migraine) class. We used a sampling method to address the imbalance problem. First, different sources of data were preprocessed and balanced datasets were generated; second, attribute selection algorithms were used to reduce the dimensionality of the data; third, a novel combination of data mining algorithms was employed in order to effectively distinguish migraine from other cases. We achieved a sensitivity and specificity of about 80 and 75 %, respectively, which is in contrast to a sensitivity and specificity of 15.7 and 97 % when using the original imbalanced data for building classifiers.
Article
Full-text available
Background Community associated methicillin-resistant Staphylococcus aureus (CA-MRSA) is one of the most common causes of skin and soft tissue infections in the United States, and a variety of genetic host factors are suspected to be risk factors for recurrent infection. Based on the CDC definition, we have developed and validated an electronic health record (EHR) based CA-MRSA phenotype algorithm utilizing both structured and unstructured data. Methods The algorithm was validated at three eMERGE consortium sites, and positive predictive value, negative predictive value and sensitivity, were calculated. The algorithm was then run and data collected across seven total sites. The resulting data was used in GWAS analysis. ResultsAcross seven sites, the CA-MRSA phenotype algorithm identified a total of 349 cases and 7761 controls among the genotyped European and African American biobank populations. PPV ranged from 68 to 100% for cases and 96 to 100% for controls; sensitivity ranged from 94 to 100% for cases and 75 to 100% for controls. Frequency of cases in the populations varied widely by site. There were no plausible GWAS-significant (p < 5 E −8) findings. Conclusions Differences in EHR data representation and screening patterns across sites may have affected identification of cases and controls and accounted for varying frequencies across sites. Future work identifying these patterns is necessary.
Chapter
Full-text available
With the increasing use of technologically advanced equipment in medical, biomedical and healthcare fields the collection of patients' data from various hospitals is also getting necessary. The availability of data at the central location is suitable so that it can be used in need of any pharmaceutical feedback, equipment's reporting, analysis and results of any disease and many more. Collected data can also be used for manipulating or predicting and upcoming health crisis due to any disaster, virus or climatically changes. Collection of Data from various health related entities or from any patient raises some serious questions upon leakage, integrity, security and privacy of data. In this chapter the term Big Data and its usage in healthcare applications is discussed. The questions and issues are highlighted and discussed in the last section of the chapter to emphasize on the broad and pre-deployment issues. Available platforms and solutions are also mentioned and detailed to overcome the arising situation and question on usage and deployment of Big Data in healthcare related fields and applications. The available data privacy, data security, users' accessing mechanisms, authentication procedures and privileges are also described.
Chapter
Full-text available
The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next-generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text—found in biomedical publications and clinical notes—is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine.
Article
Full-text available
This article presents the results of a study aimed at a developing an approach to the design of information infrastructure of medical institutions that use knowledge-based clinical decision support systems (CDSS). As a source of knowledge, we mainly consider the data stored in medical information system (MIS). The authors attempted to formulate an approach that will be flexible enough to allow engineers to realize almost any scenario of decision-support. To illustrate its practical use, we describe its application to one of the problems now being actively solved in the course of cooperation between ITMO University and Federal Almazov North-West Medical Research Centre, namely - development of a CDSS for the diagnostics of pulmonary arterial hypertension.
Chapter
Data quality (DQ) issues in Electronic Health Records (EHRs) are a noticeable trend to improve the introduction of an adaptive framework for interoperability and standards to large-scale health Database Management Systems (DBMS). In addition, EHR technology provides portfolio management systems that allow Health Care Organisations (HCOs) to deliver higher quality of care to their patients than possible with paper-based records. The EHRs are in high demand for HCOs to run their daily services as increasing numbers of huge datasets occur every day. An efficient EHRs system reduces data redundancy as well as system application failures and increases the possibility to draw all necessary reports. Improving DQ to achieve benefits through EHRs is neither low-cost nor easy. However, different HCOs have several standards and different major systems, which have emerged as critical issues and practical challenges. One of the main challenges in EHRs is the inherent difficulty to coherently manage incompatible and sometimes inconsistent data structures from diverse heterogeneous sources. As a result, the interventions to overcome these barriers and challenges, including the provision of EHRs as it pertains to DQ will combine features to search, extract, filter, clean and integrate data to ensure that users can coherently create new consistent data sets.
Article
Technology had been used in health domain for various purposes such as for storing electronic health records; monitoring; education; communication; and for behavioural tracking. The evident benefits have triggered a huge amount of discussions surrounding health technology in the web 3.0 space and users around the globe are sharing their experiences and perspective on social media platforms. Social media had been used for creating awareness, sharing information and providing emotional support to public in different diseases. This study focuses on exploring the health technology related discussions in Twitter. For this study around 105,489 tweets were collected from Twitter by 15,587 unique users. These tweets were analysed through social media analytics approaches (i.e. CUP framework). The study presents the top technologies in health domain through hashtag analysis and top diseases (acute, chronic, communicable and non-communicable) through word analysis and their association through co-occurrence of words within the tweets. The association depicts technology had been used in treating, identifying and heeling of the various diseases. The discussion on social media is skewed towards computing algorithms. The acute and chronic diseases were discussed on social media, and our study indicates that statistically, there is no difference in the discussion of acute and chronic diseases. The communicable and non-communicable diseases are also discussed on social media, and our study indicates no statistically difference in the discussion of communicable and non-communicable diseases which signifies users are referring to Twitter for discussing various type of diseases irrespective of acute, chronic, communicable and non-communicable diseases. Future researchers can use the study as the evidence of extracting insights related to socio-technical perspective from Twitter data. The literature contains lot of evidences where technology had been useful in health domain, but the bigger picture of how the various technologies are being related to health domain is missing, therefore this study tries to contribute to this area by mining tweets.
Article
Standards-based modeling of electronic health records (EHR) data holds great significance for data interoperability and large-scale usage. Integration of unstructured data into a standard data model, however, poses unique challenges partially due to heterogeneous type systems used in existing clinical NLP systems. We introduce a scalable and standards-based framework for integrating structured and unstructured EHR data leveraging the HL7 Fast Healthcare Interoperability Resources (FHIR) specification. We implemented a clinical NLP pipeline enhanced with an FHIR-based type system and performed a case study using medication data from Mayo Clinic's EHR. Two UIMA-based NLP tools known as MedXN and MedTime were integrated in the pipeline to extract FHIR MedicationStatement resources and related attributes from unstructured medication lists. We developed a rule-based approach for assigning the NLP output types to the FHIR elements represented in the type system, whereas we investigated the FHIR elements belonging to the source of the structured EMR data. We used the FHIR resource "MedicationStatement" as an example to illustrate our integration framework and methods. For evaluation, we manually annotated FHIR elements in 166 medication statements from 14 clinical notes generated by Mayo Clinic in the course of patient care, and used standard performance measures (precision, recall and f-measure). The F-scores achieved ranged from 0.73 to 0.99 for the various FHIR element representations. The results demonstrated that our framework based on the FHIR type system is feasible for normalizing and integrating both structured and unstructured EHR data.
Article
Background: Large amounts of patient data are routinely manually collected in hospitals by using standalone medical devices, including vital signs. Such data is sometimes stored in spreadsheets, not forming part of patients' electronic health records, and is therefore difficult for caregivers to combine and analyze. One possible solution to overcome these limitations is the interconnection of medical devices via the Internet using a distributed platform, namely the Internet of Things. This approach allows data from different sources to be combined in order to better diagnose patient health status and identify possible anticipatory actions. Methods: This work introduces the concept of the Internet of Health Things (IoHT), focusing on surveying the different approaches that could be applied to gather and combine data on vital signs in hospitals. Common heuristic approaches are considered, such as weighted early warning scoring systems, and the possibility of employing intelligent algorithms is analyzed. Results: As a result, this article proposes possible directions for combining patient data in hospital wards to improve efficiency, allow the optimization of resources, and minimize patient health deterioration. Conclusion: It is concluded that a patient-centered approach is critical, and that the IoHT paradigm will continue to provide more optimal solutions for patient management in hospital wards.
Article
Feature selection for predictive analytics continues to be a major challenge in the healthcare industry, particularly as it relates to readmission prediction. Several research works in mining healthcare data have focused on structured data for readmission prediction. Even within those works that are based on unstructured data, significant gaps exist in addressing class imbalance, context specific noise removal which thus necessitates new approaches readmission prediction using unstructured data. In this work, a novel approach is proposed for feature selection and domain related stop words removal from unstructured with class imbalance in discharge summary notes. The proposed predictive model uses these features along with other relevant structured data. Five iterations of predictions were performed to tune and improve the models, results of which are presented and analyzed in this paper. The authors suggest future directions in implementing the proposed approach in hospitals or clinics aimed at leveraging structured and unstructured discharge summary notes.
Chapter
The growing trend of using information technology (IT) in the present era has been associated with generating a huge amount of data. Throughout the history, the healthcare industry has generated a large amount of data on patient care. The current trend of is in the direction towards digitalization of these large amounts of data. Digital data and information in healthcare organizations are growing extensively. These data are gathered from a variety of sources and create new challenges, which lead to a lot of changes in health sciences. In the near future, the high availability of digital data makes it difficult to handle them, and big data will overcome the traditional scales and dimensions. Today, improving the performance of the healthcare industry depends on having more information and more organized knowledge. Big data allow us to do a lot of works that could not have been done in the past. The progress of IT and solutions for management of big data can lead to more effective outcomes in healthcare. This article begins by presenting the current and future statue of big data in healthcare and then explains the features of big data in the area of health as well as the potential benefits of studying big data. Finally, it identifies and ranks the challenges of using big data by the use of a multi-criteria decision-making technique. The aim of this study is to identify the most important challenges for the adoption of big data solutions for healthcare organizations.
Chapter
Nowadays, any health-related issue is always a very sensitive issue in the society as it interferes directly in the people well-being. In this sense, in order to improve the quality of health services, a good quality management of complaints is essential. Due to the volume of complaints, there is a need to explore Data Science models in order to automate internal quality complaints processes. Thus, the main objective of this article is to improve the quality of the health claims analysis process, as well as the knowledge analysis at the level of information systems applied to referred health. In this article, it is observable the development of data treatment in two stages: loading the data to an auxiliary database and processing them through the Extract, Transform and Load (ETL) process. With the data warehouse created, the Online Analytical Processing (OLAP) cube was developed that was later interconnected in Power BI enabling the creation and analysis of dashboards. The various models studied showed somehow a poor quality of the data that supports them. In this sense, with the application of the filters, it was possible to obtain a more detailed temporal perception, as the height of the year in which there is more affluence of registered complaints. Thus, we can find in this study the main analysis of paper complaints and online complaints. For paper complaints, a total of 234 records of the selected period is well-known for the “Unknown” valence affluence with 72.67% of the registrations. With regard to online complaints, a total of 42 records of the selected period is notorious for the following incidence: Typification “Other subjects” with 19.05% of registrations; State “Inserted” with 90.48% of registrations; Ignorance “Unknown” with 95.24% of registrations; Typology “Complaint” with 69.05% of registrations.
Article
The paper presents the main research results in the area of data mining application to medicine. We propose a new information technology of data mining for different classes of biomedical images based on the methodology of diagnostically relevant information selection and creation of informative characteristics. Application of Big Data technology in proposed systems of medical diagnostics has allowed to improve the learning set quality and reduce the classification error. Based on these results, the conclusion is made, that the usage of many heterogeneous sources of diagnostic information made it possible to improve the overall quality of the diagnostics.