Chapter

Comparative Analysis of Sampling Methods for Data Quality Assessment

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Recently Big Data has become one of the important new factors in the business field. This needs to have strategies to manage large volumes of structured, unstructured and semi-structured data. It’s challenging to analyze such large scale of data to extract data meaning and handling uncertain outcomes. Almost all big data sets are dirty, i.e. the set may contain inaccuracies, missing data, miscoding and other issues that influence the strength of big data analytics. One of the biggest challenges in big data analytics is to discover and repair dirty data; failure to do this can lead to inaccurate analytics and unpredictable conclusions. Data cleaning is an essential part of managing and analyzing data. In this survey paper, data quality troubles which may occur in big data processing to understand clearly why an organization requires data cleaning are examined, followed by data quality criteria (dimensions used to indicate data quality). Then, cleaning tools available in market are summarized. Also challenges faced in cleaning big data due to nature of data are discussed. Machine learning algorithms can be used to analyze data and make predictions and finally clean data automatically. © 2018 Institute of Advanced Engineering and Science. All rights reserved.
Article
Full-text available
Data quality issues trace back their origin to the early days of computing. A wide range of domain-specific techniques to assess and improve the quality of data exist in the literature. These solutions primarily target data which resides in relational databases and data warehouses. The recent emergence of big data analytics and renaissance in machine learning necessitates evaluating the suitability relational database-centric approaches to data quality. In this paper, we describe the nature of the data quality issues in the context of big data and machine learning. We discuss facets of data quality, present a data governance-driven framework for data quality lifecycle for this new scenario, and describe an approach to its implementation. A sampling of the tools available for data quality management is indicated and future trends are discussed.
Article
Full-text available
Evaluation of research artefacts (such as models, frameworks and methodologies) is essential to determine their quality and demonstrate worth. However, in the IQ research domain there is no existing standard set of criteria available for researchers to use to evaluate their IQ artefacts. This paper therefore describes our experience of selecting and synthesizing a set of evaluation criteria used in three related research areas of Information Systems (IS), Software Products (SP) and Conceptual Models (CM), and analysing their relevance to different types of IQ research artefact. We selected and used a subset of these criteria in an actual evaluation of an IQ artefact to test whether they provide any benefit over a standard evaluation. The results show that at least a subset of the criteria from the other domains of IS, SP and CM are relevant for IQ artefact evaluations, and the resulting set of criteria, most importantly, enabled a more rigorous and systematic selection of what to evaluate.
Conference Paper
Full-text available
Data is part of our everyday life and an essential asset in numerous businesses and organizations. The quality of the data, i.e., the degree to which the data characteristics fulfill requirements, can have a tremendous impact on the businesses themselves, the companies, or even in human lives. In fact, research and industry reports show that huge amounts of capital are spent to improve the quality of the data being used in many systems, sometimes even only to understand the quality of the information in use. Considering the variety of dimensions, characteristics, business views, or simply the specificities of the systems being evaluated, understanding how to measure data quality can be an extremely difficult task. In this paper we survey the state of the art in classification of poor data, including the definition of dimensions and specific data problems, we identify frequently used dimensions and map data quality problems to the identified dimensions. The huge variety of terms and definitions found suggests that further standardization efforts are required. Also, data quality research on Big Data appears to be in its initial steps, leaving open space for further research.
Article
Full-text available
Data quality (DQ) assessment and improvement in larger information systems would often not be feasible without using suitable “DQ methods”, which are algorithms that can be automatically executed by computer systems to detect and/or correct problems in datasets. Currently, these methods are already essential, and they will be of even greater importance as the quantity of data in organisational systems grows. This paper provides a review of existing methods for both DQ assessment and improvement and classifies them according to the DQ problem and problem context. Six gaps have been identified in the classification, where no current DQ methods exist, and these show where new methods are required as a guide for future research and DQ tool development.
Article
Full-text available
High quality data and effective data quality assessment are required for accurately evaluating the impact of public health interventions and measuring public health outcomes. Data, data use, and data collection process, as the three dimensions of data quality, all need to be assessed for overall data quality assessment. We reviewed current data quality assessment methods. The relevant study was identified in major databases and well-known institutional websites. We found the dimension of data was most frequently assessed. Completeness, accuracy, and timeliness were the three most-used attributes among a total of 49 attributes of data quality. The major quantitative assessment methods were descriptive surveys and data audits, whereas the common qualitative assessment methods were interview and documentation review. The limitations of the reviewed studies included inattentiveness to data use and data collection process, inconsistency in the definition of attributes of data quality, failure to address data users' concerns and a lack of systematic procedures in data quality assessment. This review study is limited by the coverage of the databases and the breadth of public health information systems. Further research could develop consistent data quality definitions and attributes. More research efforts should be given to assess the quality of data use and the quality of data collection process.
Article
Full-text available
The unprecedented volume of information on the Web brings the practical difficulties for the information consumers to assess the quality of information. The wide variety of web users and distinct situations at their hands pose the diffi-culties to the quality assessment (QA) process, which must be customizable according to the information consumer's needs. To that end, we (1) introduce the Web Quality As-sessment model formalizing the QA process driven by the customizable sets of QA policies and (2) propose a novel con-cept of the QA social networks improving the QA process by reinforcing the number of relevant QA policies succes-sively applied to the consumed resources by sharing the QA policies between users who are trusting each other. Since the ability to assess the information quality will play a fun-damental role in the continued evolution of the Web, we are convinced that the concept of QA social networks can contribute in this area.
Article
Full-text available
this article, we describe principles that can help organizations develop usable data quality metrics
Article
Business rules are an effective way to control data quality. Business experts can directly enter the rules into appropriate software without error prone communication with programmers. However, not all business situations and possible data quality problems can be considered in advance. In situations where business rules have not been defined yet, patterns of data handling may arise in practice. We employ data mining to accounting transactions in order to discover such patterns. The discovered patterns are represented in form of association rules. Then, deviations from discovered patterns can be marked as potential data quality violations that need to be examined by humans. Data quality breaches can be expensive but manual examination of many transactions is also expensive. Therefore, the goal is to find a balance between marking too many and too few transactions as being potentially erroneous. We apply appropriate procedures to evaluate the classification accuracy of developed association rules and support the decision on the number of deviations to be manually examined based on economic principles.
Article
Today’s supply chain professionals are inundated with data, motivating new ways of thinking about how data are produced, organized, and analyzed. This has provided an impetus for organizations to adopt and perfect data analytic functions (e.g. data science, predictive analytics, and big data) in order to enhance supply chain processes and, ultimately, performance. However, management decisions informed by the use of these data analytic methods are only as good as the data on which they are based. In this paper, we introduce the data quality problem in the context of supply chain management (SCM) and propose methods for monitoring and controlling data quality. In addition to advocating for the importance of addressing data quality in supply chain research and practice, we also highlight interdisciplinary research topics based on complementary theory.
Article
Many previous studies of data quality have focused on the realization and evaluation of both data value quality and data service quality. These studies revealed that poor data value quality and poor data service quality were caused by poor data structure. In this study we focus on metadata management, namely, data structure quality and introduce the data quality management maturity model as a preferred maturity model. We empirically show that data quality improves as data management matures.
Article
Missing data are a substantial problem in clinical databases. This paper aims to examine patterns of missing data in a primary care database, compare this to nationally representative datasets and explore the use of multiple imputation (MI) for these data. The patterns and extent of missing health indicators in a UK primary care database (THIN) were quantified using 488 384 patients aged 16 or over in their first year after registration with a GP from 354 General Practices. MI models were developed and the resulting data compared to that from nationally representative datasets (14 142 participants aged 16 or over from the Health Survey for England 2006 (HSE) and 4 252 men from the British Regional Heart Study (BRHS)). Between 22% (smoking) and 38% (height) of health indicator data were missing in newly registered patients, 2004-2006. Distributions of height, weight and blood pressure were comparable to HSE and BRHS, but alcohol and smoking were not. After MI the percentage of smokers and non-drinkers was higher in THIN than the comparison datasets, while the percentage of ex-smokers and heavy drinkers was lower. Height, weight and blood pressure remained similar to the comparison datasets. Given available data, the results are consistent with smoking and alcohol data missing not at random whereas height, weight and blood pressure missing at random. Further research is required on suitable imputation methods for smoking and alcohol in such databases.
Article
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area
How to Create a Business Case for Data Quality Improvement
  • S Moore
Approximate quality assessment with sampling approaches
  • H Liu
  • Z Sang
  • Larali