Fig 2 - available from: BMC Medical Informatics and Decision Making
This content is subject to copyright. Terms and conditions apply.
Examples of input data validation rules with loose and strict validation criteria 

Examples of input data validation rules with loose and strict validation criteria 

Source publication
Article
Full-text available
Background Electronic health records (EHRs) contain detailed clinical data stored in proprietary formats with non-standard codes and structures. Participating in multi-site clinical research networks requires EHR data to be restructured and transformed into a common format and standard terminologies, and optimally linked to other data sources. The...

Context in source publication

Context 1
... the data extraction step, required data elements from the source system are extracted to a temporary data stor- age from which they are transformed and loaded into the target database. The D-ETL approach uses comma- separated values (CSV) text files for data exchange due to its wide use and acceptability [45]. Extracted data then go through a data validation processes including input data checks for missing data in required fields and orphan for- eign key values (e.g. values which are present in a foreign key column but not in a primary key column) checks. In addition, data transformation processes usually have spe- cific assumptions about the value and structure of input data that require validation. Figure 2 shows a list of example validation ...

Similar publications

Article
Full-text available
Objective Health data standardized to a common data model (CDM) simplifies and facilitates research. This study examines the factors that make standardizing observational health data to the Observational Medical Outcomes Partnership (OMOP) CDM successful. Materials and methods Twenty-five data partners (DPs) from 11 countries received funding from...

Citations

... • Clarity's immunization tables were included in OMOP's procedures and medications tables automated and manual coding, a scalable solution that has been shown to improve harmonization. 31 Additionally, involving stakeholders, interested in the extracted data, into the mapping process may improve harmonization and increase data completion and correctness. 9 ...
Article
Full-text available
Background Data exploration in modern electronic health records (EHRs) is often aided by user-friendly graphical interfaces providing “self-service” tools for end users to extract data for quality improvement, patient safety, and research without prerequisite training in database querying. Other resources within the same institution, such as Honest Brokers, may extract data sourced from the same EHR but obtain different results leading to questions of data completeness and correctness. Objectives Our objectives were to (1) examine the differences in aggregate output generated by a “self-service” graphical interface data extraction tool and our institution's clinical data warehouse (CDW), sourced from the same database, and (2) examine the causative factors that may have contributed to these differences. Methods Aggregate demographic data of patients who received influenza vaccines at three static clinics and three drive-through clinics in similar locations between August 2020 and December 2020 was extracted separately from our institution's EHR data exploration tool and our CDW by our organization's Honest Brokers System. We reviewed the aggregate outputs, sliced by demographics and vaccination sites, to determine potential differences between the two outputs. We examined the underlying data model, identifying the source of each database. Results We observed discrepancies in patient volumes between the two sources, with variations in demographic information, such as age, race, ethnicity, and primary language. These variations could potentially influence research outcomes and interpretations. Conclusion This case study underscores the need for a thorough examination of data quality and the implementation of comprehensive user education to ensure accurate data extraction and interpretation. Enhancing data standardization and validation processes is crucial for supporting reliable research and informed decision-making, particularly if demographic data may be used to support targeted efforts for a specific population in research or quality improvement initiatives.
... They posit that such methods can efficiently mask or anonymize personally identifiable information (PII) from healthcare datasets, thus lowering instances of re-identification and unauthorized access. Ong et al. (2017) explained that dynamic-ETL (Extract, Transform, Load) procedures can be used to anonymize health data in real-time, to prevent the disclosure of the protected data during the ETL processes. They also discuss compliance with data governance and quality, which is important especially when handling anonymized data. ...
Article
Full-text available
Big data analytics has emerged as an incredibly valuable tool in understanding population health trends and improving the effectiveness of healthcare delivery systems. By leveraging big data sources from various domains, including the patients' electronic health records, claims data, wearables, and social media accounts, healthcare organizations can obtain novel and rich insights regarding population health characteristics, predisposing factors, and disease progression. The arrival of big data has transformed healthcare organizations by providing a scientific population health management and health system enhancement tool. Using superior analytical tools, healthcare entrepreneurs and policymakers can discover relations, rates, and patterns that are concealed in large data sets. Such knowledge can be used for early detection of possible interventions, management of resources, and prevention measures, thus leading to better health and less spending on health issues. Moreover, big data analytics helps in early diagnostics and the development of management strategies for high-risk groups, which in turn improves the functioning and effectiveness of healthcare systems. It is also important to note that big data solutions in healthcare are not limited to population health management, but also include functional aspects of healthcare organizations. Additionally, using data about patient movements, resource consumption, and clinical activity, it is possible to determine inefficiencies and opportunities to improve processes in healthcare organizations. This approach of collecting and analyzing data helps in decision making thus reducing time and improving patient flow and experience. However, incorporating real-time data into the clinical decision support systems can improve diagnostic capabilities, treatments offered, and patient tracking resulting in improved quality of services delivered.
... Some reports may only be accessible in portable document format (PDF) from the clinical information system (CIS), others originate from secondary software in a variety of different formats. 13 Data transformation processes to harmonize all the data formats from their source systems within one central database are not ubiquitously established 14 . We therefore present an open-source, LLM-based pipeline which tackles these challenges in medical information extraction (IE). ...
Preprint
Full-text available
In clinical science and practice, text data, such as clinical letters or procedure reports, is stored in an unstructured way. This type of data is not a quantifiable resource for any kind of quantitative investigations and any manual review or structured information retrieval is time-consuming and costly. The capabilities of Large Language Models (LLMs) mark a paradigm shift in natural language processing and offer new possibilities for structured Information Extraction (IE) from medical free text. This protocol describes a workflow for LLM based information extraction (LLM-AIx), enabling extraction of predefined entities from unstructured text using privacy preserving LLMs. By converting unstructured clinical text into structured data, LLM-AIx addresses a critical barrier in clinical research and practice, where the efficient extraction of information is essential for improving clinical decision-making, enhancing patient outcomes, and facilitating large-scale data analysis. The protocol consists of four main processing steps: 1) Problem definition and data preparation, 2) data preprocessing, 3) LLM-based IE and 4) output evaluation. LLM-AIx allows integration on local hospital hardware without the need of transferring any patient data to external servers. As example tasks, we applied LLM-AIx for the anonymization of fictitious clinical letters from patients with pulmonary embolism. Additionally, we extracted symptoms and laterality of the pulmonary embolism of these fictitious letters. We demonstrate troubleshooting for potential problems within the pipeline with an IE on a real-world dataset, 100 pathology reports from the Cancer Genome Atlas Program (TCGA), for TNM stage extraction. LLM-AIx can be executed without any programming knowledge via an easy-to-use interface and in no more than a few minutes or hours, depending on the LLM model selected.
... Configuration environments for Python [22], Anaconda [23], and R language [24] are used for the analysis of intermediate output data. When data are uploaded into the cloud-based platform, the data collecting service begins ETL into a distributed database system [25]. After their upload into cloud storage, data are processed by a cloud-computing engine, such as Hive or Spark, distributed using the Azkaban System, and maintained by the Apache Atlas metadata management system [26]. ...
Article
Full-text available
Big data technologies have proliferated since the dawn of the cloud-computing era. Traditional data storage, extraction, transformation, and analysis technologies have thus become unsuitable for the large volume, diversity, high processing speed, and low value density of big data in medical strategies, which require the development of novel big data application technologies. In this regard, we investigated the most recent big data platform breakthroughs in anesthesiology and designed an anesthesia decision model based on a cloud system for storing and analyzing massive amounts of data from anesthetic records. The presented Anesthesia Decision Analysis Platform performs distributed computing on medical records via several programming tools, and provides services such as keyword search, data filtering, and basic statistics to reduce inaccurate and subjective judgments by decision-makers. Importantly, it can potentially to improve anesthetic strategy and create individualized anesthesia decisions, lowering the likelihood of perioperative complications.
... However, the combination of both topics with the goal of a generic and easily adaptive mapper is missing in the literature, to our knowledge. Alongside declarative rules, the work of Ong et al [21] must be mentioned; they present a dynamic ETL approach that uses a custom mapping language to transform health care-related data into the OMOP (Observational Medical Outcomes Partnership) common data model. The formalized rules are rich in details but are proprietary and rather database-oriented due to their use case. ...
Article
Full-text available
Background Reaching meaningful interoperability between proprietary health care systems is a ubiquitous task in medical informatics, where communication servers are traditionally used for referring and transforming data from the source to target systems. The Mirth Connect Server, an open-source communication server, offers, in addition to the exchange functionality, functions for simultaneous manipulation of data. The standard Fast Healthcare Interoperability Resources (FHIR) has recently become increasingly prevalent in national health care systems. FHIR specifies its own standardized mechanisms for transforming data structures using StructureMaps and the FHIR mapping language (FML). Objective In this study, a generic approach is developed, which allows for the application of declarative mapping rules defined using FML in an exchangeable manner. A transformation engine is required to execute the mapping rules. Methods FHIR natively defines resources to support the conversion of instance data, such as an FHIR StructureMap. This resource encodes all information required to transform data from a source system to a target system. In our approach, this information is defined in an implementation-independent manner using FML. Once the mapping has been defined, executable Mirth channels are automatically generated from the resources containing the mapping in JavaScript format. These channels can then be deployed to the Mirth Connect Server. Results The resulting tool is called FML2Mirth, a Java-based transformer that derives Mirth channels from detailed declarative mapping rules based on the underlying StructureMaps. Implementation of the translate functionality is provided by the integration of a terminology server, and to achieve conformity with existing profiles, validation via the FHIR validator is built in. The system was evaluated for its practical use by transforming Labordatenträger version 2 (LDTv.2) laboratory results into Medical Information Object ( Medizinisches Informationsobjekt ) laboratory reports in accordance with the National Association of Statutory Health Insurance Physicians’ specifications and into the HL7 (Health Level Seven) Europe Laboratory Report. The system could generate complex structures, but LDTv.2 lacks some information to fully comply with the specification. Conclusions The tool for the auto-generation of Mirth channels was successfully presented. Our tests reveal the feasibility of using the complex structures of the mapping language in combination with a terminology server to transform instance data. Although the Mirth Server and the FHIR are well established in medical informatics, the combination offers space for more research, especially with regard to FML. Simultaneously, it can be stated that the mapping language still has implementation-related shortcomings that can be compensated by Mirth Connect as a base technology.
... In order to gain an overview of the potential application focuses of MDDs (Q1) and thus an indication of where the approaches have proven beneficial, the focused theme of application was first evaluated. According to the extracted data, the focuses of all included publications are classified into 7 different categories, namely medicine (n=9) [10,[32][33][34][35][36][37][38][39], data warehouse (n=13) [40][41][42][43][44][45][46][47][48][49][50][51][52], big data (n=4) [53][54][55][56], industry (n=4) [57][58][59][60], geoinformatics (n=1) [61], archaeology (n=1) [62], and military (n=1) [63]. This shows that data warehouse and medicine are the 2 categories that use the MDD approach the most. ...
... Another frequently used type of MDD approach was rule-based, which applied transformation rules generated based on the source and target to the ETL/ELT process. The rule-based approach was also widely used in the categories of data warehouse [40][41][42][43]49] and medicine [33,34,37,39]. All other MDD approaches besides the ontology-based and rule-based approaches were categorized as "other" (Table 1). ...
... This purpose can be divided into three detailed categories: (1) to automate the development of the ETL/ELT process [35,38,42,46,[48][49][50][51]60], (2) to develop a generic ETL/ELT process [39,47,52], and (3) to develop a new ETL/ELT process without any further technical specifications [40,45,46,55,57,61]. Additionally, the transformation part of the ETL/ELT process could also be automated by applying an MDD approach [34,37,41,44,58,63]. For example, Chen and Zhao [41] described an MDD approach for the automatic generation of SQL scripts for data transformation. ...
Article
Full-text available
Background Multisite clinical studies are increasingly using real-world data to gain real-world evidence. However, due to the heterogeneity of source data, it is difficult to analyze such data in a unified way across clinics. Therefore, the implementation of Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT) processes for harmonizing local health data is necessary, in order to guarantee the data quality for research. However, the development of such processes is time-consuming and unsustainable. A promising way to ease this is the generalization of ETL/ELT processes. Objective In this work, we investigate existing possibilities for the development of generic ETL/ELT processes. Particularly, we focus on approaches with low development complexity by using descriptive metadata and structural metadata. Methods We conducted a literature review following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. We used 4 publication databases (ie, PubMed, IEEE Explore, Web of Science, and Biomed Center) to search for relevant publications from 2012 to 2022. The PRISMA flow was then visualized using an R-based tool (Evidence Synthesis Hackathon). All relevant contents of the publications were extracted into a spreadsheet for further analysis and visualization. Results Regarding the PRISMA guidelines, we included 33 publications in this literature review. All included publications were categorized into 7 different focus groups (ie, medicine, data warehouse, big data, industry, geoinformatics, archaeology, and military). Based on the extracted data, ontology-based and rule-based approaches were the 2 most used approaches in different thematic categories. Different approaches and tools were chosen to achieve different purposes within the use cases. Conclusions Our literature review shows that using metadata-driven (MDD) approaches to develop an ETL/ELT process can serve different purposes in different thematic categories. The results show that it is promising to implement an ETL/ELT process by applying MDD approach to automate the data transformation from Fast Healthcare Interoperability Resources to Observational Medical Outcomes Partnership Common Data Model. However, the determining of an appropriate MDD approach and tool to implement such an ETL/ELT process remains a challenge. This is due to the lack of comprehensive insight into the characterizations of the MDD approaches presented in this study. Therefore, our next step is to evaluate the MDD approaches presented in this study and to determine the most appropriate MDD approaches and the way to integrate them into the ETL/ELT process. This could verify the ability of using MDD approaches to generalize the ETL process for harmonizing medical data.
... Table 7 is an approach model that extends the data integration approach model by considering big data issues. [38] D_ELT geospatial big data [25] DOD-ETL near real-time ETL [13] BigDimETL MultiDimensional Structure (MDS) [18], [21] Parallel-ETL Variety Data [22] QETL Multidimensional cube [23] Dynamic-ETL Semantic [40] ELTA Reduce time NewTL ETL software system [42] SimpleETL Database, table and foreign Key ...
... Miután az új információ az adattárházban meghatározott struktúrába és formátumokba kerül, a bővülés meghatározott időközönként történő tárházba való adatbetöltéssel valósul meg (ETL = extract, transform, load [kinyerési, átalakítási és betöltési] folyamat). Végeredményben az új, rendezetlen rekordokat halmozó tárolók és a merev adattárházak hiányosságai hibrid megoldást hívtak életre, az előnyöket ötvöző "adatkincstárház" (data lakehouse) modellt, amely egy köztes tárolási réteget alkalmaz, így fenntartva az eredeti adatok integritását [8][9][10]. ...
Article
Full-text available
Fragmentation of health data and biomedical research data is a major obstacle for precision medicine based on data-driven decisions. The development of personalized medicine requires the efficient exploitation of health data resources that are extraordinary in size and complexity, but highly fragmented, as well as technologies that enable data sharing across institutions and even borders. Biobanks are both sample archives and data integration centers. The analysis of large biobank data warehouses in federated datasets promises to yield conclusions with higher statistical power. A prerequisite for data sharing is harmonization, i.e., the mapping of the unique clinical and molecular characteristics of samples into a unified data model and standard codes. These databases, which are aligned to a common schema, then make healthcare information available for privacy-preserving federated data sharing and learning. The re-evaluation of sensitive health data is inconceivable without the protection of privacy, the legal and conceptual framework for which is set out in the GDPR (General Data Protection Regulation) and the FAIR (findable, accessible, interoperable, reusable) principles. For biobanks in Europe, the BBMRI-ERIC (Biobanking and Biomolecular Research Infrastructure - European Research Infrastructure Consortium) research infrastructure develops common guidelines, which the Hungarian BBMRI Node joined in 2021. As the first step, a federation of biobanks can connect fragmented datasets, providing high-quality data sets motivated by multiple research goals. Extending the approach to real-word data could also allow for higher level evaluation of data generated in the real world of patient care, and thus take the evidence generated in clinical trials within a rigorous framework to a new level. In this publication, we present the potential of federated data sharing in the context of the Semmelweis University Biobanks joint project. Orv Hetil. 2023; 164(21): 811-819.
... Así mismo, el proyecto Dynamic-ETL (D-ETL) construye una plataforma de extracción, transformación y carga de los datos empleando formatos comunes y terminologías estándares (Ong et al., 2017). Para ello, han diseñado una metodología que automatiza parte del proceso mediante un desarrollo de código escalable, reutilizable y flexible, al tiempo que conserva los aspectos manuales del proceso que requieren el conocimiento de una sintaxis de codificación compleja. ...
... Por otro lado, los estudios desarrollados por Hong Sung et al. (Sun et al., 2015), y por Anil Pacacil et al. (Pacaci et al., 2018), proponen metodologías basadas en la web semántica, a través de la representación de datos con Resource Description Framework (RDF), y conversiones expresadas a través de reglas de Notación 3 (N3). Por último, Dynamic-ETL construye un proceso compuesto por (Ong et al., 2017): (1) (Brat et al., 2020;Abbas et al., 2021;Instituto i+12, 2022;OHDSI, 2022a;TriNetX, 2022). Estos proyectos responden a modelos de datos de uso secundario de diferentes diseños y propósitos explicados anteriormente, tales como repositorios clínicos, formularios de reporte de casos y conjuntos de datos agregados. ...
... Esto explica por qué las metodologías anteriores de reutilización de la HCE se han centrado en la transformación de la HCE a modelos de repositorio estandarizados como OMOP CDM (Sun et al., 2015;Ong et al., 2017;Pacaci et al., 2018), en lugar de modelos de restricciones complejas como los CRF y los conjuntos de datos agregados. Así mismo, el conjunto de operaciones de datos identificado es sólo una especificación inicial que puede ampliarse tras la aplicación de la metodología a nuevos casos de uso. ...
Thesis
Full-text available
The collection of health data for research and other secondary purposes must evolve from a paradigm based on the manual recording in specific information systems for a purpose of exploitation, towards an effective reuse of the data recorded during the care process in the Electronic Health Record, known as real-world data (RWD). However, to achieve this ideal scenario it is necessary that the data are recorded, extracted, and transformed from the information systems with full meaning, and through formal and transparent processes that make them understandable, auditable, and reproducible. This thesis aims to propose a methodology for the recording, management and reuse of health data based on dual architecture models, also known as Detailed Clinical Models (DCM). Thus, the contributions of the thesis are: (1) The study of the paradigm of Detailed Clinical Models in the EHR, specifically the UNE-EN ISO 13606 standard, for the management and governance of concept and data models in the different typologies of information systems that compose the EHR; (2) The analysis of standard terminologies, such as SNOMED CT and LOINC, to represent the meaning of the concepts of the health domain formalized through clinical archetypes; (3) The proposal of a formal, transparent and automated process of extraction, selection and transformation of health data for reuse in any purpose and application scenario; (4) The evaluation of the validity, utility and acceptability of the methodology in its application to different use cases. These contributions have been applied to different health data projects developed at the Hospital Universitario 12 de Octubre in the COVID-19 pandemic. These projects specified data models of different typologies: standardized repositories, case report forms and aggregated data sets. As these projects were developed in a critical situation such as the COVID-19 pandemic, the data were required in an agile and flexible manner and without additional effort for health professionals, thus providing an ideal scenario to apply and evaluate the proposed health data reuse methodology. The conclusion of this PhD Thesis is that the proposed methodology has made it possible to obtain valid and useful data for research projects, using a process that is accepted by the consumers of the data. This is a first step towards changing the paradigm of data collection for research, going from ad-hoc processes of manual collection for a single purpose, to a process that is efficient, as it takes advantage of what is already recorded in the Electronic Health Record; flexible, as it is applicable for multiple purposes and for any organization that demands the data; and transparent, as it can be analyzed in technical or functional auditing processes.
... The use of this common framework makes the operations, and their constraints, understandable by any organization wishing to incorporate them into its EHR reuse process, at whatever point in the process it deems necessary. [47][48][49][50][51] However, a future step of this work will be the formalization, not only of the input of the operations, but also of the outputs after their application as well. At this point in the development of the methodology, the loading of data into research databases is contemplated as a manual process once the data have been obtained in accordance with the requirements of the output model. ...
... An improvement to this process will be the implementation of a graphical interface tool that abstracts the user to fill in the XML file directly in a text editor, making it more accessible to nontechnical personnel as previous works have done. 47,48,51 Conclusions This study has provided a novel solution to the difficulty of making the ETL processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible. Thus, a transparent and flexible methodology was designed based on open standards and technologies, applicable to any clinical condition and health care organization, and even to EHR reuse processes already in place. ...
Article
Full-text available
Background: During the COVID-19 pandemic, several methodologies were designed for obtaining electronic health record (EHR)-derived datasets for research. These processes are often based on black boxes, on which clinical researchers are unaware of how the data were recorded, extracted, and transformed. In order to solve this, it is essential that extract, transform, and load (ETL) processes are based on transparent, homogeneous, and formal methodologies, making them understandable, reproducible, and auditable. Objectives: This study aims to design and implement a methodology, according with FAIR Principles, for building ETL processes (focused on data extraction, selection, and transformation) for EHR reuse in a transparent and flexible manner, applicable to any clinical condition and health care organization. Methods: The proposed methodology comprises four stages: (1) analysis of secondary use models and identification of data operations, based on internationally used clinical repositories, case report forms, and aggregated datasets; (2) modeling and formalization of data operations, through the paradigm of the Detailed Clinical Models; (3) agnostic development of data operations, selecting SQL and R as programming languages; and (4) automation of the ETL instantiation, building a formal configuration file with XML. Results: First, four international projects were analyzed to identify 17 operations, necessary to obtain datasets according to the specifications of these projects from the EHR. With this, each of the data operations was formalized, using the ISO 13606 reference model, specifying the valid data types as arguments, inputs and outputs, and their cardinality. Then, an agnostic catalog of data was developed through data-oriented programming languages previously selected. Finally, an automated ETL instantiation process was built from an ETL configuration file formally defined. Conclusions: This study has provided a transparent and flexible solution to the difficulty of making the processes for obtaining EHR-derived data for secondary use understandable, auditable, and reproducible. Moreover, the abstraction carried out in this study means that any previous EHR reuse methodology can incorporate these results into them.