ArticleLiterature Review
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Background and objective: In recent years, several data quality conceptual frameworks have been proposed across the Data Quality and Information Quality domains towards assessment of quality of data. These frameworks are diverse, varying from simple lists of concepts to complex ontological and taxonomical representations of data quality concepts. The goal of this study is to design, develop and implement a platform agnostic computable data quality knowledge repository for data quality assessments. Methods: We identified computable data quality concepts by performing a comprehensive literature review of articles indexed in three major bibliographic data sources. From this corpus, we extracted data quality concepts, their definitions, applicable measures, their computability and identified conceptual relationships. We used these relationships to design and develop a data quality meta-model and implemented it in a quality knowledge repository. Results: We identified three primitives for programmatically performing data quality assessments: data quality concept, its definition, its measure or rule for data quality assessment, and their associations. We modeled a computable data quality meta-data repository and extended this framework to adapt, store, retrieve and automate assessment of other existing data quality assessment models. Conclusion: We identified research gaps in data quality literature towards automating data quality assessments methods. In this process, we designed, developed and implemented a computable data quality knowledge repository for assessing quality and characterizing data in health data repositories. We leverage this knowledge repository in a service-oriented architecture to perform scalable and reproducible framework for data quality assessments in disparate biomedical data sources.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... ity improvement purposes. 11,[16][17][18][19] These developments have neither explicitly nor systematically addressed the contextual and process categories that are vital to DQ assessment and management in RWD production and curation life cycles. The senior authors (S.-T.L., M.G.K., S.d.L.) therefore guided the integration of the HIDQF and data life cycle framework as the conceptual starting point for this literature review to identify practical and potential gaps in the assessment and management of DQ. ...
... This knowledge repository has been leveraged into a service-oriented architecture to perform as a scalable and reproducible framework for DQ assessments of disparate data sources. 18 Web-based tools such as the TAQIH (tabular DQ assessment and improvement for health) have been developed to conduct exploratory data analyses to improve completeness, accuracy, redundancy, and readability. 27 The ACHILLES (Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems) tool was developed and maintained by the OHDSI collaborative to conduct DQ checks on databases that are based on or mapped to the Observational Medical Outcomes Partnership Common Data Model (CDM). ...
... 2,29,30 The CDM-based ACHILLES 23 approach to DQ assessment is being enhanced through a service-oriented architecture-based Open Quality and Analytics Framework. 18 The Open Quality and Analytics Framework includes a DQ metamodel, a federated data integration platform to support semantically consistent metadata-centric querying of heterogeneous data sources, and a visualization metaframework to store visualizations for different DQ concepts, indicators, and measures. This supports the inclusion of a technical category in the HIDQF. ...
Article
Objective: Data quality (DQ) must be consistently defined in context. The attributes, metadata, and context of longitudinal real-world data (RWD) have not been formalized for quality improvement across the data production and curation life cycle. We sought to complete a literature review on DQ assessment frameworks, indicators and tools for research, public health, service, and quality improvement across the data life cycle. Materials and Methods: The review followed PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Databases from health, physical and social sciences were used: Cinahl, Embase, Scopus, ProQuest, Emcare, PsycINFO, Compendex, and Inspec. Embase was used instead of PubMed (an interface to search MEDLINE) because it includes all MeSH (Medical Subject Headings) terms used and journals in MEDLINE as well as additional unique journals and conference abstracts. A combined data life cycle and quality framework guided the search of published and gray literature for DQ frameworks, indicators, and tools. At least 2 authors independently identified articles for inclusion and extracted and categorized DQ concepts and constructs. All authors discussed findings iteratively until consensus was reached. Results: The 120 included articles yielded concepts related to contextual (data source, custodian, and user) and technical (interoperability) factors across the data life cycle. Contextual DQ subcategories included relevance, usability, accessibility, timeliness, and trust. Well-tested computable DQ indicators and assessment tools were also found. Conclusions: A DQ assessment framework that covers intrinsic, technical, and contextual categories across the data life cycle enables assessment and management of RWD repositories to ensure fitness for purpose. Balancing security, privacy, and FAIR principles requires trust and reciprocity, transparent governance, and organizational cultures that value good documentation.
... As they were selected specifically for the task of IRC measurement, this list of six dimensions was not identical to task-independent dimensions suggested by other studies, which required specific attributes from the data sources. For example, the framework of computable dimensions by Rajan et al. (2019) included the dimension Currency (also named Timeliness in some studies). This dimension required information about the average "out of date" values of data, which were not provided by the data sources under the survey. ...
... Another limitation of our study is that we operationalised the data dimensions by applying only computable metrics, compared to subjective ones (Rajan et al., 2019). Because not all metrics can be measured without human beings' judgement, the number of metrics to measure each DQD was limited. ...
Article
Full-text available
Measuring international research collaboration (IRC) is essential to various research assessment tasks but the effect of various measurement decisions, including which data sources to use, has not been thoroughly studied. To better understand the effect of data source choice on IRC measurement we design and implement a data quality assessment framework specifically for bibliographic data by reviewing and selecting available dimensions and designing appropriate computable metrics, and then validate the framework by applying it to four popular sources of bibliographic data: Microsoft Academic Graph, Web of Science, Dimensions, and the ACM Digital Library. Successful validation of the framework suggests it is consistent with the popular conceptual framework of information quality proposed by Wang and Strong (1996) and adequately identifies the differences in quality in the sources examined. Application of the framework reveals that Web of Science has the highest overall quality among the sets considered; and that the differences in quality can be explained primarily by how the data sources are organised. Our study comprises a methodological contribution that enables researchers to apply this IRC measurement tool in their studies; makes an empirical contribution by further characterising four popular sources of bibliographic data and their impact on IRC measurement. Peer Review https://publons.com/publon/10.1162/qss_a_00211
... Such a situation, long well-known, is becoming even more evident with pandemic data when decisions potentially impacting a large part of the population are based also on data having a partially known and controlled acquisition and transformation process. New concepts based on the idea of "fit for purpose" may be useful in this area as different types of studies may need different data quality characteristics, making the concept of absolute data quality less desirable [82,83]. ...
... Such a situation, long well-known, is becoming even more evident with pandemic data when decisions potentially impacting a large part of the population are based also on data having a partially known and controlled acquisition and transformation process. New concepts based on the idea of "fit for purpose" may be useful in this area as different types of studies may need different data quality characteristics, making the concept of absolute data quality less desirable [82,83]. ...
Article
Full-text available
In 2020, the CoViD-19 pandemic spread worldwide in an unexpected way and suddenly modified many life issues, including social habits, social relationships, teaching modalities, and more. Such changes were also observable in many different healthcare and medical contexts. Moreover, the CoViD-19 pandemic acted as a stress test for many research endeavors, and revealed some limitations, especially in contexts where research results had an immediate impact on the social and healthcare habits of millions of people. As a result, the research community is called to perform a deep analysis of the steps already taken, and to re-think steps for the near and far future to capitalize on the lessons learned due to the pandemic. In this direction, on June 09th–11th, 2022, a group of twelve healthcare informatics researchers met in Rochester, MN, USA. This meeting was initiated by the Institute for Healthcare Informatics—IHI, and hosted by the Mayo Clinic. The goal of the meeting was to discuss and propose a research agenda for biomedical and health informatics for the next decade, in light of the changes and the lessons learned from the CoViD-19 pandemic. This article reports the main topics discussed and the conclusions reached. The intended readers of this paper, besides the biomedical and health informatics research community, are all those stakeholders in academia, industry, and government, who could benefit from the new research findings in biomedical and health informatics research. Indeed, research directions and social and policy implications are the main focus of the research agenda we propose, according to three levels: the care of individuals, the healthcare system view, and the population view.
... An open-source, interoperable, and extensible data quality assessment tool (DQe-c) has been designed and developed for assessing and visualizing the integrity and consistency of electronic health record (EHR) data repositories [4]. A computable quality knowledge base (QKR) is proposed to automate data quality assessment (DQA) in heterogeneous and disparate data environments for assessing quality and characterization data in healthcare data repositories [5]. A method for standardizing reporting and evaluating data quality assessments of the nursing quality indicators database by developing a data quality framework (DQF) and using the data quality index (DQI) assessment framework key dimensions is proposed. ...
Article
Full-text available
This article mainly focuses on the preprocessing method of medical heterogeneous equipment health data sources and the performance measurement of single-layer perceptron network intelligent computing. It structures a data quality evaluation system of medical heterogeneous equipment with five different dimensions: patient personal information, medical basic data, medical testing data, medical treatment data and medical device data. An innovative preprocessing algorithm of data sources is proposed to study the missing data, the error data, the repetition data and the validity data. By constructing a single-layer perceptron network, accuracy, misjudgment rate, precision, recall, true positive rate and false positive rate in intelligent computing are studied, and the corresponding mathematical calculation formulas are established. In the application research, this article collected 157 original data from a medical institution. The algorithm is applied and the models are tested. The research solved the problems of intelligent computing performance measurement of heterogeneous devices based on single-layer perceptron network.
... Bu güven veri kalitesinin yüksekliğiyle iç içedir. [9]. EUROSTAT veri kalite kriterleri uygunluk, doğruluk, güncellik, erişilebilirlik, kıyaslanabilirlik, tutarlılık ve bütünlük kriterleriyle değerlendirilmiştir [10]. ...
Article
Veri günümüzde çok sık karşılaşılan bir terimdir. Verinin doğru kullanımı doğru değerlendirmeyi sağlar. Bu da kaynakların verimli kullanımını, verilen hizmetin kalitesinin artmasını sağlamaktadır. Verinin en çok toplandığı alanların başında sağlık sektörü gelmektedir. Sağlık hizmet sunumunun maddi ve manevi yükü ağırdır. Bu hizmetin en iyi şekilde verilmesi, kaynakların doğru kullanılması ile yakın ilişkilidir. Sağlık verilerinden anlamlı sonuçların çıkarılarak hekimlere, hemşirelere ve sağlık yöneticileri gibi sağlık sektörü çalışanlarına yön gösterecek bilgilerin sağlanması sağlık verilerinin büyüklüğü düşünüldüğünde ancak veri madenciliği metotları ile mümkündür. Sağlık sektörünün insan hayatını direkt etkileyen bir doğası olması sebebi ile sağlıkta kullanılan verilerin kalitesinin en üst düzeyde olması beklenmektedir. Bu çalışmada veri kalitesini ve veri madenciliğini bütüncül olarak ele almıştır. Uygulama örnekleri aracılığıyla veri madenciliği ile sağlık sektöründe ne tür çalışmalar yapılabileceğine dair genel bir bakış açısı sağlanmıştır.
... As existing quality assessment tools are primarily designed to assist with the development and evaluation of individual ontologies, understanding how to address quality issues when applied to multiple ontologies is vital for supporting these efforts. [360][361][362][363][364]. If this framework could successfully characterize ontology quality assessment it could potentially help standardize existing metrics and tools and provide a new avenue to support the harmonization of a wide range of translational resources. ...
Thesis
Full-text available
Traditional computational phenotypes (CPs) identify patient cohorts without consideration of underlying pathophysiological mechanisms. Deeper patient-level characterizations are necessary for personalized medicine and while advanced methods exist, their application in clinical settings remains largely unrealized. This thesis advances deep CPs through several experiments designed to address four requirements. Stability was examined through three experiments. First, a multiphase study was performed and identified resources and remediation plans as barriers preventing data quality (DQ) assessment. Then, through two experiments, the Harmonized DQ Framework was used to characterize DQ checks from six clinical organizations and 12 biomedical ontologies finding Atemporal Plausibility and Completeness and Value Conformance as the most common clinical checks and Value and Relation Conformance as the most common biomedical ontology checks. Scalability was examined through three experiments. First, a novel composite patient similarity algorithm was developed that demonstrated that information from clinical terminology hierarchies improved patient representations when applied to small populations. Then, ablation studies were performed and showed that the combination of data type, sampling window, and clinical domain used to characterize rare disease patients differed by disease. Finally, an algorithm that losslessly transforms complex knowledge graphs (KGs) into representations more suitable for inductive inference was developed and validated through the generation of expert-verified plausible novel drug candidates. Interoperability was examined through two experiments. First, 36 strategies to align five eMERGE CPs to standard clinical terminologies were examined and revealed lower false negative and positive counts in adults than in pediatric patient populations. Then, hospital-scale mappings between clinical terminologies and biomedical ontologies were developed and found to be accurate, generalizable, and logically consistent. Multimodality was examined through two experiments. A novel ecosystem for constructing ontologically-grounded KGs under alternative knowledge models using different relation strategies and abstraction strategies was created. The resulting KGs were validated through successfully enriching portions of the preeclampsia molecular signature with no previously known literature associations. These experiments were used to develop a joint learning framework for inferring molecular characterizations of patients from clinical data. The utility of this framework was demonstrated through the accurate inference of EHR-derived rare disease patient genotypes/phenotypes from publicly available molecular data.
... It should be noted that, although more recent DQCs exist (e.g. G€ urd€ ur et al., 2019;Huang, 2018;Liu et al., 2020;Rajan et al., 2019;Rasool and Warraich, 2018;Teh et al., 2020), the ones shown are still frequently cited in current DQ research. In the tables, the final two columns show the number of Scopus (Sc) citations and the number of Google Scholar (GS) citations (citations counts were conducted on August 14, 2020). ...
Article
Purpose: Numerous data quality (DQ) definitions in the form of sets of DQ dimensions are found in the literature. The great differences across such DQ classifications (DQCs) imply a lack of clarity about what DQ is. For an improved foundation for future research, this paper aims to clarify the ways in which DQCs differ and provide guidelines for dealing with this variance. ------ Design/methodology/approach: A literature review identifies DQCs in conference and journal articles, which are analyzed to reveal the types of differences across these. On this basis, guidelines for future research are developed. ----- Findings: The literature review found 110 unique DQCs in journals and conference articles. The analysis of these articles identified seven distinct types of differences across DQCs. This gave rise to the development of seven guidelines for future DQ research. ----- Research limitations/implications: By identifying differences across DQCs and providing a set of guidelines, this paper may promote that future research, to a greater extent, will converge around common understandings of DQ. ----- Practical implications: Awareness of the identified types of differences across DQCs may support managers when planning and conducting DQ improvement projects. ----- Originality/value: The literature review did not identify articles, which, based on systematic searches, identify and analyze existing DQCs. Thus, this paper provides new knowledge on the variance across DQCs, as well as guidelines for addressing this.
... Data quality dimension is intended to categorize types of data quality measurement. It consists of accuracy, completeness, concordance, consistency, currency, redundancy, or any other dimension [28,34]. For instance, the World Health Organization identifies the completeness and timeliness of the data, the internal coherence of data, the external coherence of data, and data comparisons on the entire population as data quality indicators [24]. ...
Chapter
Since the right decision is made from the correct data, assessing data quality is an important process in computational science when working in a data-driven environment. Appropriate data quality ensures the validity of decisions made by any decision-maker. A very promising area to overcome common data quality issues is computational intelligence. This paper examines from past to current intelligence techniques used for assessing data quality, reflecting the trend for the last two decades. Results of a bibliometric analysis are derived and summarized based on the embedded clustered themes in the data quality field. In addition, a network visualization map and strategic diagrams based on keyword co-occurrence are presented. These reports demonstrate that computational intelligence, such as machine and deep learning, fuzzy set theory, evolutionary computing is essential for uncovering and solving data quality issues.
... Therefore, a standardized reporting scheme describing data quality would be essential to assess whether research questions could be appropriately answered with a data collection at hand. 29 The interpretation of the new world registries would be challenging as well. ...
Article
Full-text available
Background Patient registries are an established methodology in health services research. Since more than 150 years, registries collect information concerning groups of similar patients to answer research questions. Elaborated recommendations about an appropriate development and an efficient operation of registries are available. However, the scene changes rapidly. Objectives The aim of the study is to describe current trends in registry research for health services research. Methods Registries developed within a German funding scheme for model registries in health services research were analyzed. The observations were compared with recent recommendations of the Agency for Healthcare Research and Quality (AHRQ) on registries in the 21st century. Results Analyzing six registries from the funding scheme revealed the following trends: recruiting healthy individuals, representing familial or other interpersonal relationships, recording of patient-reported experiences or outcomes, accepting participants as study sites, active informing of participants, integrating the registry with other data collections, and transferring data from the registry to electronic patient records. This list partly complies with the issues discussed by the AHRQ. The AHRQ structured its ideas in five chapters, increasing focus on the patient, engaging patients as partners, digital health and patient registries, direct-to-patient registry, and registry networks. Conclusion For the near future, it can be said that the concept and the design of a registry should place the patient in the center. Registries will be increasingly linked together and interconnected with other data collections. New challenges arise regarding the management of data quality and the interpretation of results from less controlled settings. Here, further research related to the methodology of registries is needed.
... The research gap in article [39] is the focus on data quality assessment based on metrics which are not sufficient for the contextual data quality; however, our proposed model supports deep learning and semantic ontology based data quality assessment methodology. To analyze data quality of the healthcare data, authors [40] developed the data quality knowledge repositories; that contain the characteristics of data quality rules. These rules have been used to measure the quality of healthcare data. ...
Article
Full-text available
Due to the convergence of advanced technologies such as the Internet of Things, Artificial Intelligence, and Big Data, a healthcare platform accumulates data in a huge quantity from several heterogeneous sources. The adequate usage of this data may increase the impact of and improve the healthcare service quality; however, the quality of the data may be questionable. Assessing the quality of the data for the task in hand may reduce the associated risks, and increase the confidence of the data usability. To overcome the aforementioned challenges, this paper presents the web objects based contextual data quality assessment model with enhanced classification metric parameters. A semantic ontology of virtual objects, composite virtual objects, and services is also proposed for the parameterization of contextual data quality assessment of web objects data. The novelty of this article is the provision of contextual data quality assessment mechanisms at the data acquisition, assessment, and service level for the web objects enabled semantic data applications. To evaluate the proposed data quality assessment mechanism, web objects enabled affective stress and teens' mood care semantic data applications are designed, and a deep data quality learning model is developed. The findings of the proposed approach reveal that, once a data quality assessment model is trained on web objects enabled healthcare semantic data, it could be used to classify the incoming data quality in various contextual data quality metric parameters. Moreover, the data quality assessment mechanism presented in this paper can be used to other application domains by incorporating data quality analysis requirements ontology.
Article
Full-text available
This paper presents a comprehensive exploration of data quality terminology, revealing a significant lack of standardisation in the field. The goal of this work was to conduct a comparative analysis of data quality terminology across different domains and structure it into a hierarchical data model. We propose a novel approach for aggregating disparate data quality terms used to describe the multiple facets of data quality under common umbrella terms with a focus on the ISO 25012 standard. We introduce four additional data quality dimensions: governance, usefulness, quantity, and semantics. These dimensions enhance specificity, complementing the framework established by the ISO 25012 standard, as well as contribute to a broad understanding of data quality aspects. The ISO 25012 standard, a general standard for managing the data quality in information systems, offers a foundation for the development of our proposed Data Quality Data Model. This is due to the prevalent nature of digital systems across a multitude of domains. In contrast, frameworks such as ALCOA+, which were originally developed for specific regulated industries, can be applied more broadly but may not always be generalisable. Ultimately, the model we propose aggregates and classifies data quality terminology, facilitating seamless communication of the data quality between different domains when collaboration is required to tackle cross-domain projects or challenges. By establishing this hierarchical model, we aim to improve understanding and implementation of data quality practices, thereby addressing critical issues in various domains.
Article
Full-text available
Data quality issues can significantly hinder research reproducibility, data sharing, and reuse. At the forefront of addressing data quality issues are research data repositories (RDRs). This study conducted a systematic analysis of data quality assurance (DQA) practices in RDRs, guided by activity theory and data quality literature, resulting in conceptualizing a data quality assurance model (DQAM) for RDRs. DQAM outlines a DQA process comprising evaluation, intervention, and communication activities and categorizes 17 quality dimensions into intrinsic and product‐level data quality. It also details specific improvement actions for data products and identifies the essential roles, skills, standards, and tools for DQA in RDRs. By comparing DQAM with existing DQA models, the study highlights its potential to improve these models by adding a specific DQA activity structure. The theoretical implication of the study is a systematic conceptualization of DQA work in RDRs that is grounded in a comprehensive analysis of the literature and offers a refined conceptualization of DQA integration into broader frameworks of RDR evaluation. In practice, DQAM can inform the design and development of DQA workflows and tools. As a future research direction, the study suggests applying and evaluating DQAM across various domains to validate and refine this model further.
Article
Background The increasing prevalence of electronic health records (EHRs) in healthcare systems globally has underscored the importance of data quality for clinical decision-making and research, particularly in obstetrics. High-quality data is vital for an accurate representation of patient populations and to avoid erroneous healthcare decisions. However, existing studies have highlighted significant challenges in EHR data quality, necessitating innovative tools and methodologies for effective data quality assessment and improvement. Objective This article addresses the critical need for data quality evaluation in obstetrics by developing a novel tool. The tool utilizes Health Level 7 (HL7) Fast Healthcare Interoperable Resources (FHIR) standards in conjunction with Bayesian Networks and expert rules, offering a novel approach to assessing data quality in real-world obstetrics data. Methods A harmonized framework focusing on completeness, plausibility, and conformance underpins our methodology. We employed Bayesian networks for advanced probabilistic modeling, integrated outlier detection methods, and a rule-based system grounded in domain-specific knowledge. The development and validation of the tool were based on obstetrics data from 9 Portuguese hospitals, spanning the years 2019-2020. Results The developed tool demonstrated strong potential for identifying data quality issues in obstetrics EHRs. Bayesian networks used in the tool showed high performance for various features with area under the receiver operating characteristic curve (AUROC) between 75% and 97%. The tool’s infrastructure and interoperable format as a FHIR Application Programming Interface (API) enables a possible deployment of a real-time data quality assessment in obstetrics settings. Our initial assessments show promised, even when compared with physicians’ assessment of real records, the tool can reach AUROC of 88%, depending on the threshold defined. Discussion Our results also show that obstetrics clinical records are difficult to assess in terms of quality and assessments like ours could benefit from more categorical approaches of ranking between bad and good quality. Conclusion This study contributes significantly to the field of EHR data quality assessment, with a specific focus on obstetrics. The combination of HL7-FHIR interoperability, machine learning techniques, and expert knowledge presents a robust, adaptable solution to the challenges of healthcare data quality. Future research should explore tailored data quality evaluations for different healthcare contexts, as well as further validation of the tool capabilities, enhancing the tool’s utility across diverse medical domains.
Article
Full-text available
Background Health care has not reached the full potential of the secondary use of health data because of—among other issues—concerns about the quality of the data being used. The shift toward digital health has led to an increase in the volume of health data. However, this increase in quantity has not been matched by a proportional improvement in the quality of health data. Objective This review aims to offer a comprehensive overview of the existing frameworks for data quality dimensions and assessment methods for the secondary use of health data. In addition, it aims to consolidate the results into a unified framework. Methods A review of reviews was conducted including reviews describing frameworks of data quality dimensions and their assessment methods, specifically from a secondary use perspective. Reviews were excluded if they were not related to the health care ecosystem, lacked relevant information related to our research objective, and were published in languages other than English. Results A total of 22 reviews were included, comprising 22 frameworks, with 23 different terms for dimensions, and 62 definitions of dimensions. All dimensions were mapped toward the data quality framework of the European Institute for Innovation through Health Data. In total, 8 reviews mentioned 38 different assessment methods, pertaining to 31 definitions of the dimensions. Conclusions The findings in this review revealed a lack of consensus in the literature regarding the terminology, definitions, and assessment methods for data quality dimensions. This creates ambiguity and difficulties in developing specific assessment methods. This study goes a step further by assigning all observed definitions to a consolidated framework of 9 data quality dimensions.
Article
Child welfare decisions have life-impacting consequences which, often times, are underpinned by limited or inadequate data and poor quality. Though research of data quality has gained popularity and made advancements in various practical areas, it has not made significant inroads for child welfare fields or data systems. Poor data quality can hinder service decision-making, impacting child behavioral health and well-being as well as increasing unnecessary expenditure of time and resources. Poor data quality can also undermine the validity of research and slow policymaking processes. The purpose of this commentary is to summarize the data quality research base in other fields, describe obstacles and uniqueness to improve data quality in child welfare, and propose necessary steps to scientific research and practical implementation that enables researchers and practitioners to improve the quality of child welfare services based on the enhanced quality of data.
Chapter
Clinical information, stored over time and increasingly linked to other types of information such as environmental and social determinants of health and healthcare claims, is a potentially rich data source for clinical research. Knowledge discovery in databases (KDD) is a process for pattern discovery and predictive modeling in large databases. KDD encompasses and makes extensive use of data-mining methods—automated processes and algorithms that enable pattern recognition and classification. Characteristically, KDD involves the use of machine learning methods developed in the domain of artificial intelligence and information retrieval. These methods, which include both structure learning and parameter learning, have been applied to healthcare and biomedical data for various purposes with good success and potential or realized clinical translation. We introduce the Fayyad model of knowledge discovery in databases and describe the steps of the process, providing select examples from clinical research informatics. These steps range from initial data selection and preparation to interpretation and evaluation. Commonly used data-mining methods are surveyed: artificial neural networks, decision-tree induction, support vector machines (kernel methods), association-rule induction, k-nearest neighbor, and probabilistic methods such as Bayesian networks. We link methods for evaluating the models that result from the KDD process to methods used in diagnostic medicine, spotlighting measures derived from a confusion matrix and receiver operating characteristic curve analysis and, more recently, uncertainty quantification and conformal prediction. Throughout the chapter, we discuss salient aspects of biomedical data management and use, including applications, the use of FAIR principles, pipelines and infrastructure for KDD, and future directions.KeywordsKnowledge discovery in databasesData miningArtificial neural networksSupport vector machinesDecision treesk-Nearest neighbor classificationClinical data repositoriesMachine learning
Article
Full-text available
Resumo: A gestão dos dados de pesquisa é reconhecida pela comunidade científica como parte importante das boas práticas de pesquisa. Desta maneira, acredita-se que os mesmos devem estar sempre disponíveis para acesso e reuso. Neste contexto, a curadoria e a qualidade de dados são entendidas como elementos estratégicos. Este trabalho tem como objetivo caracterizar e especificar a produção científica existente sobre o tema "qualidade de dados em gestão de dados de pesquisa", por meio da aferição de indicadores bibliométricos. Em termos metodológicos, esta pesquisa possui natureza quantitativa e qualitativa, é de tipo exploratória quanto a seus objetivos e utiliza-se das bases de dados Web of Science e Scopus para a composição do corpus do estudo bibliométrico. Como resultado, identificou-se, a partir de um corpus de 77 artigos, um período de publicações relevantes entre os anos de 1984 e 2020, sendo o ano de 2019 aquele com mais trabalhos publicados. Adicionalmente, 7 veículos de publicação apresentaram mais de um artigo no tópico pesquisado, sendo os Estados Unidos o país com mais trabalhos publicados, totalizando 34. A área da Ciência da Computação foi a que mais produziu nesse tema e constitui uma tendência em sua interdisciplinaridade com as ciências biológicas, sociais aplicadas e da saúde. Finalmente, conclui-se que, a partir da consciência de que
Conference Paper
Full-text available
IoT along with social media is one of the biggest reasons why real-time data analysis is crucial to analyze the data due to IoT data being huge in size. Since there are analysis types depending on the type of data, there is a need to apply multiple analytics techniques to achieve the results. We created one code with three analytics models: Binary classification model, Outlier detection model and Time series model, which computes average accuracy, detects the outliers and computes mean absolute error as well as predicted deviation over a time respectively. The average accuracy was observed to be 84.6% in binary classification model and mean absolute error was observed to be 7.79% in time series model.
Article
Genomic data are growing at unprecedented pace, along with new protocols, update polices, formats and guidelines, terminologies and ontologies, which are made available every day by data providers. In this continuously evolving universe, enforcing quality on data and metadata is increasingly critical. While many aspects of data quality are addressed at each individual source, we focus on the need for a systematic approach when data from several sources are integrated, as such integration is an essential aspect for modern genomic data analysis. Data quality must be assessed from many perspectives, including accessibility, currency, representational consistency, specificity, and reliability. In this article we review relevant literature and, based on the analysis of many datasets and platforms, we report on methods used for guaranteeing data quality while integrating heterogeneous data sources. We explore several real-world cases that are exemplary of more general underlying data quality problems and we illustrate how they can be resolved with a structured method, sensibly applicable also to other biomedical domains. The overviewed methods are implemented in a large framework for the integration of processed genomic data, which is made available to the research community for supporting tertiary data analysis over Next Generation Sequencing datasets, continuously loaded from many open data sources, bringing considerable added value to biological knowledge discovery.
Article
Full-text available
Knowledge repository is commonly used to store and manage explicit knowledge from knowledge worker. The development of internet of things (IoT) in Industry 4.0 enables the improvement of quality and quantity of knowledge repository. This study focuses on developing IoT-based knowledge repository as a means to support knowledge integration within The Marine Information System Study Program. The study provides architectural designs of a knowledge repository that consists of 5 knowledge domains of marine information system and the design of IoT-based knowledge repository, both of which are based on reviews on relevant literatures and focus group discussion. The solution offered by the current study allows knowledge integration through Community of Practice as the provider and the user of the knowledge.
Conference Paper
Full-text available
The field of Big Data and Big Data Analytics is one of the most emerging fields in today's technology. Data being the most important aspect of Big Data, it is important to select it in such a way that it provides maximum efficiency when analyzed. Data quality is a concept which is taken in account while measuring the quality of the dataset. Data quality imposes certain defined characteristics on data which needs to be fulfilled up to a certain extent for the data to be considered reliable for analysis. In this paper, we have proposed an approach which introduces a 'believability factor' to measure the reliability or believability of the dataset by taking a sample of the unstructured dataset. This paper along with believability factor proposes the methodology to calculate execution time of the sampled dataset and Mean Absolute Error between the believability of the sampled and unstructured dataset.
Article
Full-text available
Introduction Electronic health record (EHR) data are known to have significant data quality issues, yet the practice and frequency of assessing EHR data is unknown. We sought to understand current practices and attitudes towards reporting data quality assessment (DQA) results by data professionals. Methods The project was conducted in four Phases: (1) examined current DQA practices among informatics/CER stakeholders via engagement meeting (07/2014); (2) characterized organizations conducting DQA by interviewing key personnel and data management professionals (07-08/2014); (3) developed and administered an anonymous survey to data professionals (03-06/2015); and (4) validated survey results during a follow-up informatics/CER stakeholder engagement meeting (06/2016). Results The first engagement meeting identified the theme of unintended consequences as a primary barrier to DQA. Interviewees were predominantly medical groups serving distributed networks with formalized DQAs. Consistent with the interviews, most survey (N=111) respondents utilized DQA processes/programs. A lack of resources and clear definitions of how to judge the quality of a dataset were the most commonly cited individual barriers. Vague quality action plans/expectations and data owners not trained in problem identification and problem-solving skills were the most commonly cited organizational barriers. Solutions included allocating resources for DQA, establishing standards and guidelines, and changing organizational culture. Discussion Several barriers affecting DQA and reporting were identified. Community alignment towards systematic DQA and reporting is needed to overcome these barriers. Conclusion Understanding barriers and solutions to DQA reporting is vital for establishing trust in the secondary use of EHR data for quality improvement and the pursuit of personalized medicine.
Article
Full-text available
Objective: Harmonized data quality (DQ) assessment terms, methods, and reporting practices can establish a common understanding of the strengths and limitations of electronic health record (EHR) data for operational analytics, quality improvement, and research. Existing published DQ terms were harmonized to a comprehensive unified terminology with definitions and examples and organized into a conceptual framework to support a common approach to defining whether EHR data is ‘fit’ for specific uses. Materials and Methods: DQ publications, informatics and analytics experts, managers of established DQ programs, and operational manuals from several mature EHR-based research networks were reviewed to identify potential DQ terms and categories. Two face-to-face stakeholder meetings were used to vet an initial set of DQ terms and definitions that were grouped into an overall conceptual framework. Feedback received from data producers and users was used to construct a draft set of harmonized DQ terms and categories. Multiple rounds of iterative refinement resulted in a set of terms and organizing framework consisting of DQ categories, subcategories, terms, definitions, and examples. The harmonized terminology and logical framework’s inclusiveness was evaluated against ten published DQ terminologies. Results: Existing DQ terms were harmonized and organized into a framework by defining three DQ categories: (1) Conformance (2) Completeness and (3) Plausibility and two DQ assessment contexts: (1) Verification and (2) Validation. Conformance and Plausibility categories were further divided into subcategories. Each category and subcategory was defined with respect to whether the data may be verified with organizational data, or validated against an accepted gold standard, depending on proposed context and uses. The coverage of the harmonized DQ terminology was validated by successfully aligning to multiple published DQ terminologies. Discussion: Existing DQ concepts, community input, and expert review informed the development of a distinct set of terms, organized into categories and subcategories. The resulting DQ terms successfully encompassed a wide range of disparate DQ terminologies. Operational definitions were developed to provide guidance for implementing DQ assessment procedures. The resulting structure is an inclusive DQ framework for standardizing DQ assessment and reporting. While our analysis focused on the DQ issues often found in EHR data, the new terminology may be applicable to a wide range of electronic health data such as administrative, research, and patient-reported data. Conclusion: A consistent, common DQ terminology, organized into a logical framework, is an initial step in enabling data owners and users, patients, and policy makers to evaluate and communicate data quality findings in a well-defined manner with a shared vocabulary. Future work will leverage the framework and terminology to develop reusable data quality assessment and reporting methods.
Article
Full-text available
Objectives: We examine the following: (1) the appropriateness of using a data quality (DQ) framework developed for relational databases as a data-cleaning tool for a data set extracted from two EPIC databases, and (2) the differences in statistical parameter estimates on a data set cleaned with the DQ framework and data set not cleaned with the DQ framework. Background: The use of data contained within electronic health records (EHRs) has the potential to open doors for a new wave of innovative research. Without adequate preparation of such large data sets for analysis, the results might be erroneous, which might affect clinical decision-making or the results of Comparative Effectives Research studies. Methods: Two emergency department (ED) data sets extracted from EPIC databases (adult ED and children ED) were used as examples for examining the five concepts of DQ based on a DQ assessment framework designed for EHR databases. The first data set contained 70,061 visits; and the second data set contained 2,815,550 visits. SPSS Syntax examples as well as step-by-step instructions of how to apply the five key DQ concepts these EHR database extracts are provided. Conclusions: SPSS Syntax to address each of the DQ concepts proposed by Kahn et al. (2012)1 was developed. The data set cleaned using Kahn's framework yielded more accurate results than the data set cleaned without this framework. Future plans involve creating functions in R language for cleaning data extracted from the EHR as well as an R package that combines DQ checks with missing data analysis functions.
Article
Full-text available
Systematic reviews and meta-analyses are essential to summarize evidence relating to efficacy and safety of health care interventions accurately and reliably. The clarity and transparency of these reports, however, is not optimal. Poor reporting of systematic reviews diminishes their value to clinicians, policy makers, and other users. Since the development of the QUOROM (QUality Of Reporting Of Meta-analysis) Statement—a reporting guideline published in 1999—there have been several conceptual, methodological, and practical advances regarding the conduct and reporting of systematic reviews and meta-analyses. Also, reviews of published systematic reviews have found that key information about these studies is often poorly reported. Realizing these issues, an international group that included experienced authors and methodologists developed PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) as an evolution of the original QUOROM guideline for systematic reviews and meta-analyses of evaluations of health care interventions. The PRISMA Statement consists of a 27-item checklist and a four-phase flow diagram. The checklist includes items deemed essential for transparent reporting of a systematic review. In this Explanation and Elaboration document, we explain the meaning and rationale for each checklist item. For each item, we include an example of good reporting and, where possible, references to relevant empirical studies and methodological literature. The PRISMA Statement, this document, and the associated Web site (http://www.prisma-statement.org/) should be helpful resources to improve reporting of systematic reviews and meta-analyses.
Article
Full-text available
Assessing the quality of the information proposed by an information system has become one of the major research topics in the last two decades. A quick literature survey shows that a significant number of information quality frameworks are proposed in different domains of application: management information systems, web information systems, information fusion systems, and so forth. Unfortunately, they do not provide a feasible methodology that is both simple and intuitive to be implemented in practice. In order to address this need, we present in this article a new information quality methodology. Our methodology makes use of existing frameworks and proposes a three-step process capable of tracking the quality changes through the system. In the first step and as a novelty compared to existing studies, we propose decomposing the information system into its elementary modules. Having access to each module allows us to locally define the information quality. Then, in the second step, we model each processing module by a quality transfer function, capturing the module’s influence over the information quality. In the third step, we make use of the previous two steps in order to estimate the quality of the entire information system. Thus, our methodology allows informing the end-user on both output quality and local quality. The proof of concept of our methodology has been carried out considering two applications: an automatic target recognition system and a diagnosis coding support system.
Article
Full-text available
Poor data quality can be a serious threat to the validity and generalizability of clinical research findings. The growing availability of electronic administrative and clinical data is accompanied by a growing concern about the quality of these data for observational research and other analytic purposes. Currently, there are no widely accepted guidelines for reporting quality results that would enable investigators and consumers to independently determine if a data source is fit for use to support analytic inferences and reliable evidence generation. We developed a conceptual model that captures the flow of data from data originator across successive data stewards and finally to the data consumer. This "data lifecycle" model illustrates how data quality issues can result in data being returned back to previous data custodians. We highlight the potential risks of poor data quality on clinical practice and research results. Because of the need to ensure transparent reporting of a data quality issues, we created a unifying data-quality reporting framework and a complementary set of 20 data-quality reporting recommendations for studies that use observational clinical and administrative data for secondary data analysis. We obtained stakeholder input on the perceived value of each recommendation by soliciting public comments via two face-to-face meetings of informatics and comparative-effectiveness investigators, through multiple public webinars targeted to the health services research community, and with an open access online wiki. Our recommendations propose reporting on both general and analysis-specific data quality features. The goals of these recommendations are to improve the reporting of data quality measures for studies that use observational clinical and administrative data, to ensure transparency and consistency in computing data quality measures, and to facilitate best practices and trust in the new clinical discoveries based on secondary use of observational data.
Article
Full-text available
The growing amount of data in operational electronic health record systems provides unprecedented opportunity for its reuse for many tasks, including comparative effectiveness research. However, there are many caveats to the use of such data. Electronic health record data from clinical settings may be inaccurate, incomplete, transformed in ways that undermine their meaning, unrecoverable for research, of unknown provenance, of insufficient granularity, and incompatible with research protocols. However, the quantity and real-world nature of these data provide impetus for their use, and we develop a list of caveats to inform would-be users of such data as well as provide an informatics roadmap that aims to insure this opportunity to augment comparative effectiveness research can be best leveraged.
Article
Full-text available
Objective To review the methods and dimensions of data quality assessment in the context of electronic health record (EHR) data reuse for research. Materials and methods A review of the clinical research literature discussing data quality assessment methodology for EHR data was performed. Using an iterative process, the aspects of data quality being measured were abstracted and categorized, as well as the methods of assessment used. Results Five dimensions of data quality were identified, which are completeness, correctness, concordance, plausibility, and currency, and seven broad categories of data quality assessment methods: comparison with gold standards, data element agreement, data source agreement, distribution comparison, validity checks, log review, and element presence. Discussion Examination of the methods by which clinical researchers have investigated the quality and suitability of EHR data for research shows that there are fundamental features of data quality, which may be difficult to measure, as well as proxy dimensions. Researchers interested in the reuse of EHR data for clinical research are recommended to consider the adoption of a consistent taxonomy of EHR data quality, to remain aware of the task-dependence of data quality, to integrate work on data quality assessment from other fields, and to adopt systematic, empirically driven, statistically based methods of data quality assessment. Conclusion There is currently little consistency or potential generalizability in the methods used to assess EHR data quality. If the reuse of EHR data for clinical research is to become accepted, researchers should adopt validated, systematic methods of EHR data quality assessment.
Article
Full-text available
Data in computer-based patient records (CPRs) have many uses beyond their primary role in patient care, including research and health-system management. Although the accuracy of CPR data directly affects these applications, there has been only sporadic interest in, and no previous review of, data accuracy in CPRs. This paper reviews the published studies of data accuracy in CPRs. These studies report highly variable levels of accuracy. This variability stems from differences in study design, in types of data studied, and in the CPRs themselves. These differences confound interpretation of this literature. We conclude that our knowledge of data accuracy in CPRs is not commensurate with its importance and further studies are needed. We propose methodological guidelines for studying accuracy that address shortcomings of the current literature. As CPR data are used increasingly for research, methods used in research databases to continuously monitor and improve accuracy should be applied to CPRs.
Article
Full-text available
The literature provides a wide range of techniques to assess and improve the quality of data. Due to the diversity and complexity of these techniques, research has recently focused on defining methodologies that help the selection, customization, and application of data quality assessment and improvement techniques. The goal of this article is to provide a systematic and comparative description of such methodologies. Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, the types of information systems addressed by each methodology. The article concludes with a summary description of each methodology.
Article
Full-text available
The Federated Utah Research and Translational Health e-Repository (FURTHeR) is a Utah statewide informatics platform for the new Center for Clinical and Translational Science at the University of Utah. We have been working on one of FURTHeR's key components, a federated query engine for heterogeneous resources, that we believe has the potential to meet some of the fundamental needs of translational science to access and integrate diverse biomedical data and promote discovery of new knowledge. The architecture of the federated query engine for heterogeneous resources is described and demonstrated.
Article
Full-text available
this article, we describe principles that can help organizations develop usable data quality metrics
Article
The secondary use of EHR data for research is expected to improve health outcomes for patients, but the benefits will only be realized if the data in the EHR is of sufficient quality to support these uses. A data quality (DQ) ontology was developed to rigorously define concepts and enable automated computation of data quality measures. The healthcare data quality literature was mined for the important terms used to describe data quality concepts and harmonized into an ontology. Four high-level data quality dimensions ("correctness", "consistency", "completeness" and "currency") categorize 19 lower level measures. The ontology serves as an unambiguous vocabulary, which defines concepts more precisely than natural language; it provides a mechanism to automatically compute data quality measures; and is reusable across domains and use cases. A detailed example is presented to demonstrate its utility. The DQ ontology can make data validation more common and reproducible.
Article
The provision of high quality data is of considerable importance to health sector. Healthcare is a domain in which the timely provision of accurate, current and complete patient data is one of most important objectives. The quality of Electronic Health Record (EHR) data concerns health professionals and researchers for secondary use. To ensure high quality data in health sector, health-related organisations need to have appropriate methodologies and measurement processes to assess and analyse the quality of their data. Yet, no adequate attention has been paid to the existing data quality problems (dirty data) in health-related research. In practice, anomalies detection and cleansing is time-consuming and labour-intensive which makes it unrealistic to most health-related organisations. This paper proposes a dimension-oriented taxonomy of data quality problems. The mechanism of the data quality assessment relates the business impacts into data quality dimensions. As a case study, the new taxonomy-based data quality assessment was used to assess the quality of data populating an EHR system in a large Saudi Arabian hospital. The assessment results were discussed and reviewed with the top management of the hospital as well as the assessment team who participated in the data quality assessment process. Then, the assessment team evaluated this new approach.
Article
Beyond the hype of Big Data, something within business intelligence projects is indeed changing. This is mainly because Big Data is not only about data, but also about a complete conceptual and technological stack including raw and processed data, storage, ways of managing data, processing and analytics. A challenge that becomes even trickier is the management of the quality of the data in Big Data environments. More than ever before the need for assessing the Quality-in-Use gains importance since the real contribution - business value - of data can be only estimated in its context of use. Although there exists di�erent Data Quality models for assessing the quality of regular data, none of them has been adapted to Big Data. To �ll this gap, we propose the \3As Data Quality-in-Use model", which is composed of three Data Quality characteristics for assessing the levels of Data Quality-in-Use in Big Data projects: Contextual Adequacy, Operational Adequacy and Temporal Adequacy. The model can be integrated into any sort of Big Data project, as it is independent of any pre-conditions or technologies. The paper shows the way to use the model with a working example. The model accomplishes every challenge related to Data Quality program aimed for Big Data. The main conclusion is that the model can be used as an appropriate way to obtain the Quality-in-Use levels of the input data of the Big Data analysis, and those levels can be understood as indicators of trustworthyness and soundness of the results of the Big Data analysis.
Conference Paper
Retrospective/observational clinical research studies are dependent on the secondary use of electronic health record (EHR) data for obtaining important results about the effectiveness of different medical interventions. In contrast to traditional clinical trials these studies provide results from real-world clinical settings, but suffer from data quality issues. Therefore, it is important to take into account the nature and quality of data when designing these studies in order to differentiate between true and artifactual variations [1]. We are developing a service-oriented framework to assess the quality of EHR data.
Article
Microbiology study results are necessary for conducting many comparative effectiveness research studies. Unlike core laboratory test results, microbiology results have a complex structure. Federating and integrating microbiology data from six disparate electronic medical record systems is challenging and requires a team of varied skills. The PHIS+ consortium which is partnership between members of the Pediatric Research in Inpatient Settings (PRIS) network, the Children's Hospital Association and the University of Utah, have used "FURTHeR' for federating laboratory data. We present our process and initial results for federating microbiology data from six pediatric hospitals.
Article
Answers to clinical and public health research questions increasingly require aggregated data from multiple sites. Data from electronic health records and other clinical sources are useful for such studies, but require stringent quality assessment. Data quality assessment is particularly important in multisite studies to distinguish true variations in care from data quality problems. We propose a "fit-for-use" conceptual model for data quality assessment and a process model for planning and conducting single-site and multisite data quality assessments. These approaches are illustrated using examples from prior multisite studies. Critical components of multisite data quality assessment include: thoughtful prioritization of variables and data quality dimensions for assessment; development and use of standardized approaches to data quality assessment that can improve data utility over time; iterative cycles of assessment within and between sites; targeting assessment toward data domains known to be vulnerable to quality problems; and detailed documentation of the rationale and outcomes of data quality assessments to inform data users. The assessment process requires constant communication between site-level data providers, data coordinating centers, and principal investigators. A conceptually based and systematically executed approach to data quality assessment is essential to achieve the potential of the electronic revolution in health care. High-quality data allow "learning health care organizations" to analyze and act on their own information, to compare their outcomes to peers, and to address critical scientific questions from the population perspective.
Article
The duplication of standard medical record format and content is not the intent of a computer-based medical record. Seven characteristics designed to improve patient care, efficiency of data entry, and research capability in a computerized system are enumerated. These characteristics are further amplified and demonstrated by reference to a system in use by the Biographics Department of the University of Utah. Portions of the system dealt with specifically are the file structure, patient screening, ICU monitoring, diagnosis entry and data retrieval.
Article
Poor data quality (DQ) can have substantial social and economic impacts. Although firms are improving data quality with practical approaches and tools, their improvement efforts tend to focus narrowly on accuracy. We believe that data consumers have a much broader data quality conceptualization than IS professionals realize. The purpose of this paper is to develop a framework that captures the aspects of data quality that are important to data consumers.A two-stage survey and a two-phase sorting study were conducted to develop a hierarchical framework for organizing data quality dimensions. This framework captures dimensions of data quality that are important to data consumers. Intrinsic DQ denotes that data have quality in their own right. Contextual DQ highlights the requirement that data quality must be considered within the context of the task at hand. Representational DQ and accessibility DQ emphasize the importance of the role of systems. These findings are consistent with our understanding that high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.Our framework has been used effectively in industry and government. Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs. The salient feature of this research study is that quality attributes of data are collected from data consumers instead of being defined theoretically or based on researchers' experience. Although exploratory, this research provides a basis for future studies that measure data quality along the dimensions of this framework.
Article
This paper describes the historial developments of the ER model from the 70's to recent years. It starts with a discussion of the motivations and the environmental factors in the early days. Then, the paper points out the role of the ER model in the Computer-Aided Software Engineering (CASE) movement in the late 80's and early 90's. It also describes the possibility of the role of author's Chinese culture heritage in the development of the ER model. In that context, the relationships between natural languages (including Ancient Egyptian hieroglyphs) and ER concepts are explored. Finally, the lessons learned and future directions are presented. 1
FURTHeR: an infrastructure for clinical, translational and comparative effectiveness research
  • R Gouripeddi
  • D N Schultz
  • R L Bradshaw
  • P Mo
  • R Butcher
  • R K Madsen
  • P B Warner
  • B Lasalle
  • J C Facelli
An informatics architecture for an exposome
  • Gouripeddi
Measuring validity of phenotyping algorithms across disparate data using a data quality assessment framework
  • Sundar Rajan
Information and documentation - the Dublin Core metadata element set
  • Iso
Going FURTHeR with the metadata repository
  • Bradshaw