ArticlePDF Available

Abstract and Figures

Nowadays, activities and decisions making in an organization is based on data and information obtained from data analysis, which provides various services for constructing reliable and accurate process. As data are significant resources in all organizations the quality of data is critical for managers and operating processes to identify related performance issues. Moreover, high quality data can increase opportunity for achieving top services in an organization. However, identifying various aspects of data quality from definition, dimensions, types, strategies, techniques are essential to equip methods and processes for improving data. This paper focuses on systematic review of data quality dimensions in order to use at proposed framework which combining data mining and statistical techniques to measure dependencies among dimensions and illustrate how extracting knowledge can increase process quality.
Content may be subject to copyright.
978-1-4673-1090-1/12/$31.00 ©2012 IEEE 300
Data Quality:A Survey of Data Quality Dimensions
Fatimah Sidi,
Payam Hassany Shariat Panahy,
Lilly Suriani Affendey,
Marzanah A. Jabar,
Hamidah Ibrahim,
Aida Mustapha
Faculty of Computer Science and Information Technology
University Putra Malaysia
{fatimacd, suriani, marzanah, hamidah, aida} @
Abstract Nowadays, activities and decisions making in an
organization is based on data and information obtained from
data analysis, which provides various services for constructing
reliable and accurate process. As data are significant
resources in all organizations the quality of data is critical for
managers and operating processes to identify related
performance issues. Moreover, high quality data can increase
opportunity for achieving top services in an organization.
However, identifying various aspects of data quality from
definition, dimensions, types, strategies, techniques are
essential to equip methods and processes for improving data.
This paper focuses on systematic review of data quality
dimensions in order to use at proposed framework which
combining data mining and statistical techniques to measure
dependencies among dimensions and illustrate how extracting
knowledge can increase process quality.
Keywords-Data Quality,Data Quality Dimensions,Types of
In order to support organisation’s activity we should
design the process activity appropriately since this involves
data. Data is the primary foundation in operational, tactical
and decisions making activities. As data are crucial resources
in all organizations, business and governmental application,
the quality of data is critical for managers and operating
processes to identify related performance issues [1], [2], [3].
There are variety of data quality issues from definition,
measurement, analysis and improvement which are essential
for ensuring high data quality [4]. As the various research
shows if the process’s quality as well as information’s inputs
are not controlled, after a while, the degradation of the data
quality will be obvious [2].
For improving process’s quality with enhanced efficiency
in production and administration, using process design is
necessary for automation and management technology. It is
well known that both of them present, services to business
and individual user quickly and consistency [5]. Despite
availability of large variety of techniques for accessing and
improving data quality such as business rules, record linkage
and similarity measures, due to rise difficulty and
multiplicity of using these system, data quality methodology
has defined and provided [2].
Data quality can provide various services for an
organisation as well as nowadays, high quality data can
increase opportunity to achieve top services in an
organization. Likewise, lack of data quality in organizations
can be multiply in the Cooperative Information System
(CIS). In fact, (SIC) is an information system with capability
to distribute and share general objective between
interconnect different systems among of various independent
organization in different geographical area as the data is
basic recourses for it [6].
Some researchers have identified and tested some
effecting factors on data quality inside an organization with
collecting data from survey and interview with senior
manager and their results show that management
responsibilities such as commitment improving data quality
continually , effective communication among stakeholder and
understanding of data quality are significant elements for
influencing data quality in an organization [1].
The rest of this paper discusses on data quality strategies
and techniques, types of data, data quality definitions, data
quality problems classification and data quality dimensions
to provide, fundamental issues in this field.
There are two types of strategies that one adapted for
improving data quality namely data-driven and process-
driven, and each strategy employs various techniques [2].
However, improving the quality of data is the aim of each
A. Data-Driven
Data-driven is strategy for improving the quality of data
by modifying the data value directly. Some related
improvement techniques of data-driven are: acquisition of
new data, standardization or normalization, error
localization and correction, record linkage, data and
schema integration, source trustworthiness, as well as cost
optimization [2].
B. Process- Driven
Process-driven is another strategy that redesigns the
process which is produced or modified data in order to
improve its quality. Process-driven strategy consists of two
main techniques: process control and process redesign. In
fact, in the process control data will be check and manage
among the manufacturing process, while in the process
redesign the causes of low quality will be eliminated and
new process will be added in other to producing high
quality. Furthermore, adding an activity that can control
format of data before storage is another fact in the process
redesign [2].
However, the advantages of Process-driven is better
performing than Data-driven techniques in long period,
because they remove root causes of the quality problems
completely. In contrast, Data-driven is expensive than
Process-driven in long period but it is efficient in short
period [2].
Data are real world objects, with ability of storing,
retrieving and elaborating through a software process and
can communicate via a network [7]. Researchers have
provided different classification for data in different area.
As implicitly or explicitly, three types of data are described
in the field of DQ [2] .Table I presents types of data based
on this classification.
A second classification of data is based on considering
data as a product, this model classify data in to three types.
Table II shows this classification.
Types of
Structured data
Generalization or aggregation of
items described by elementary
attributes defined within a domain.
Relational tables
Statistical Data
A generic sequence of symbols,
typically coded in natural
Body of an Email
with free text
Semi structure
Data that have a structure with
some degree of flexibility.
Mark up
language, XML
Types of Data
Raw data items
Smaller data unites which are used to create
information and component data items
Component data items
Data is constructed from raw data items and
stored temporarily until final product is
Information products
Data ,which is the consequence of performing
manufacturing activity on data
Another classification of data is based on strictness to
measure and to achieve data quality, which has two class
specifically elementary data and aggregated data. In an
organization, data which managed by operational process
and represent atomic phenomena of the real world are called
elementary data, (e.g., sex, age), While data which are
collected from elementary data for applying aggregation
function, is called aggregated data, (e.g., average income
that tax payer paid in a specify city) [7].From point of view,
data can be classified in different types based on their usage
in variety of field (e.g., network or web).
Data quality has different definition on different field and
period. Researcher and expert made different understanding
about data quality. According to quality management data
quality is appropriate for use or to meet user needs or it is
quality of data to meet customer needs [9].Also, another
definition for data quality is fitness for use. Indeed, quality of
data is critical for improvement process activity as it can be
addressed in different field including management, medicine,
statistics and computer science. The widespread collection of
definition through data quality may give opportunity to better
understand the nature of data process.
Data quality problem generally can be divided in to two
classes that are single-source and multi-source problem.
According to some research four categories for data quality
are identified which are shown as the following table.
As a result, the goal of classifying data quality problem is
illustrating non-standard data and identifying exact
application of data for corresponding requirements [10].
Data quality
Single -source
Lack of integrity constraints, poor schema
Uniqueness constraints
Referential integrity
Data entry errors
Redundancy Duplicates
Contradictory values
Heterogeneous data models and schema
Naming Conflicts
Overlapping contradicting and
inconsistence data
Inconsistent aggregating
Inconsistent timing
Table IV illustrate some data quality dimensions and their
definition from literature. From the research perspective,
there is various numbers of dimensions for Information
Quality and Data quality. In fact, Data Quality”,
Information System and accounting and auditing are
three initial categories for identifying proper DQ dimensions
[11]. In the field of Data Quality ,Wang [11] determined
four categories that are Intrinsic DQ, Accessibility DQ,
Contextual DQ, Representational DQ and fifteen
dimensions for DQ/IQ (e.g., objectivity, believability,
reputation, value added). Other researcher recognized extra
dimensions for DQ such as data validation, credibility,
traceability, availability for identifying. In the area of
Information Systems, researcher identified different factors
such as reliability, precision, relevancy, usability, and
independency. In the accounting and auditing, researcher
explained that accuracy, timeliness and relevance are three
data quality dimensions. In addition, in this area some
scholars explained that internal control systems need lowest
cost and highest reliability which refers to some dimensions
such as accuracy, frequency and size of data [12].
Base on the ISO standard, quality means the totality of
the characteristics of an entity that bear on its ability to
satisfy stated and implied needs [13].
A Data Quality Dimension is a characteristic or part of
information for classifying information and data
requirements. In fact, it offers a way for measuring and
managing data quality as well as information [14].
So, primary step for understanding data quality
dimension can help us to improve it. Analyser and developer
use dimension and taxonomy of separate data via using data
quality tools for creating and manipulating the information in
order to improve information and its process.
The extent to which age of the data is
appropriated for the task at hand [15].
Timeliness refers only to the delay between a
change of a real world state and the resulting
modification of the information system state [2,
Timeliness has two components: age and
volatility. Age or currency is a measure of how
old the information is, based on how long age it
was recorded. Volatility is a measure of
information instability the frequency of change
of the value for an entity attribute [2, 16].
Currency is the degree to which a datum is up-
to-date. A datum value is up-to-date if it is
correct is spite of possible discrepancies caused
by time-related changes to the correct value [2,
Currency describes when the information was
entered in the sources and/or the data
warehouse. Volatility describes the time period
for which information is valid in the real world
[2, 18].
Types of Data
The extent to which data is presented in the
same format and compatible with previous data
Refer to the violation of semantic rules defined
over the set of data [2].
Data are accurate when data values stored in the
database correspond to real-world values [2, 19].
The extent which data is correct, reliable and
certified [15].
Accuracy is a measure of the proximity of a data
value, v, to some other value, v’, that is
considered correct [2, 17].
A measure of the correction of the data (which
requires an authoritative source of reference to
be identified and accessible [14].
The ability of an information system to represent
every meaningful state of the represented real
world system [2, 11].
The extent to which data are of sufficient
breadth, depth and scope for the task at hand
The degree to which values are present in a data
collection [2, 17].
Percentage of the real-world information entered
in the sources and/or the data warehouse [2, 18].
Information having all having all required parts
of an entity’s information present [2, 16].
Ratio between the number of non-null values in
a source and the size of the universal relation [2,
All values that are supposed to be collected as
per a collection theory [2, 21].
Extent to which information is available, or
easily and quickly retrievable [15].
A measure of unwanted duplication existing
within or across systems for a particular field,
record, or data set [14].
A measure of the existence, completeness,
quality and documentation of data standards,
data models, business rules .meta data and
reference data [14].
A measure of how information is presented to
and collected from does how utilize it. Format
and appearance support appropriate use of
information [14].
To extend to which data is presented in the same
format [22].
To extent to which information is highly
regarded in terms of source or content [15].
It is the capability of the function to achieve
acceptable levels of risk of harm to people,
process, property or the environment [13].
amount of data
To extend to which data volume of data is
appropriate for the task at hand [22].
Types of Data
Extent to which access to information is
restricted appropriately to maintain its security
Extent to which information is regarded as true
and credible [15].
Extent to which data are clear without ambiguity
and easily comprehended [15].
To extend to which data is easily comprehended
Extent to which information is unbiased,
unprejudiced and impartial [15].
Extent to which information is applicable and
helpful for the task at hand [15].
It is the capability of the function to enable to
users to achieve specified goals with accuracy
and completeness in a specified context of use
To extend to which data is appropriate
languages, symbols, and units and the definition
are the clear [22].
Ease of
To extend to which data is easy to manipulate
and apply to different same format [22].
Free-of error
To extend to which data is correct and reliable
Ease of Use and
A measure of the degree to which data can be
accessed and used and the degree to which data
can be updated, maintained, and managed [14].
To extent to which information is clear and
easily used [23].
Extent to which information is correct and
reliable [15].
It is the capability of the function to maintain a
specified level of performance when used on
specified condition [13].
Amount of data
To extent to which the quantity or volume of
available data is appropriate [15].
Freshness represents a family of quality factors
which each one representing some freshness
aspect and having on its metrics [24].
Value added
To extent to which information is beneficial,
provides advantages from its use [15].
Learn ability
It means the capability of the function to enable
to user to learn it [13].
Data Decay
A measure of the rate of negative change to data
Extent to which information is compactly
represented without being overwhelming (i.e.
brief in presentation, yet complete and to the
point) [15].
Consistency and
A measure of the equivalence of information
used in various data stores, applications, and
systems, and the processes for making data
equivalent [14].
Data integrity
A measure of the existence, validity, structure,
content, and other basic characteristics of the
data [14].
Types of Data
Extent to which data are easily found and linked
to [23].
Extent to which information is applicable and
helpful for the task at hand [15].
Extent to which data are able to quickly meet the
information needs for the task at hand [15].
Extent to which information is physically
accessible [23].
Data Coverage
A measure of the availability and
comprehensiveness of data compared to the total
data universe or population of interest [14].
A measure of the degree to which data will
produce the desired business transaction or
outcome [14].
Timeliness and
A measure of the degree to which data are
current and available for use as specified and in
the time frame in which they are expected [14].
Most people think the quality of data is depended only
to its accuracy and they do not consider and analyze other
significant dimensions for achieving higher quality. Indeed,
quality of data is more than considering one dimension so,
the issue of dimensions dependencies is essential to
improve process quality in different domain and
applications. Nevertheless, without knowing the existing
relations between data quality dimensions, knowledge
discovery cannot be effective and comprehensive for
decision making process. From previous work found out,
not only dimensions can be strongly related to each other
but also, data quality can be supported via the effective
dependencies [25].In fact, select appropriate dimensions
with identifying correlation among them can create high
quality data. In order to discover dependencies among more
commonly referenced dimensions consist of accuracy,
currency, consistency and completeness, we proposed
framework which combining data mining and statistical
techniques to measure dependencies among dimensions and
illustrate how extracting knowledge can increase process
quality. So, based on our hypothesis if there is a correlation
between completeness, consistency and accuracy
dimensions which are considered independent variable and
then, consider currency correlation as dependent variable
among them, improvement in data quality will be happened.
Also, cause of some difficulties on currency dimension the
policy is required.
Fig.1 illustrate proposed framework for evaluating the
effect of independent dimensions on dependent dimensions.
Data Quality
Data Quality
So, the aim of the proposed framework is discovering the
dependency structure for the assessed data quality
From the perspective research, many scholars have
identified various methodology and framework for assessing
and improving data quality through different techniques and
strategies on the data quality dimensions [2].They illustrated
definitions for dimensions and identified more important
data quality dimensions [2], [11], [12], [22]. Existing
survey identified forty data quality dimension since 1985 till
2009. Since, some dimensions such as timeliness, currency,
accuracy and completeness are more referenced than others,
the result of this survey will be used to find correlations
among data quality dimension based on proposed
framework with combining data mining and statistical
techniques for measuring dependencies among them and
illustrate how process quality will be increased via the
extracting knowledge. Specifically, our future work would
be to evaluate dependency among mentioned data
quality dimensions for improving process quality.
[1] S. W. Tee, P.L. Bowen, P. Doyle, F.H. Rohde, "Factors influencing
organizations to improve data quality in their information systems,"
Accounting & Finance, vol. 47, pp. 335-355, 2007.
[2] C. Batini, C. Cappiello, C. Francalanci, A. Maurino, "Methodologies
for data quality assessment and improvement," ACM Computing
Surveys (CSUR), vol. 41, p. 16, 2009.
[3] W. Eckerson, "Data Warehousing Special Report: Data quality and
the bottom line," Applications Development Trends May, 2002.
[4] Y.Y.R. Wang, R.Y. Wang, M. Ziad, Y.W. Lee, Data quality vol. 23:
Springer, 2001.
[5] F. Casati, M.C. Shan, M. Sayal, "Investigating business processes,"
ed: Google Patents, 2009.
[6] M. Mecella, M. Scannapieco, A. Virgillito, R. Baldoni, T. Catarci, C.
Batini, "Managing data quality in cooperative information systems,"
On the Move to Meaningful Internet Systems 2002: CoopIS, DOA,
and ODBASE, pp. 486-502, 2002.
[7] C. Batini and M. Scannapieca, Data quality: Concepts, methodologies
and techniques: Springer-Verlag New York Inc, 2006.
[8] V. Peralta, "Data quality evaluation in data integration systems,"
Université de Versailles (chair) Raúl RUGGIA Professor,
Universidad de la República, Uruguay, 2008.
[9] F. G. Alizamini, M.M. Pedram, M. Alishahi, K. Badie, "Data quality
improvement using fuzzy association rules," 2010, pp. V1-468-V1-
[10] Y. Man, L. Wei, H. Gang, G. Juntao, "A noval data quality
controlling and assessing model based on rules," 2010, pp. 29-32.
[11] Y. Wand and R. Y. Wang, "Anchoring data quality dimensions in
ontological foundations," Communications of the ACM, vol. 39, pp.
86-95, 1996.
[12] KQ. Wang, SR. Tong, L. Roucoules, B. Eynard, "Analysis of data
quality and information quality problems in digital manufacturing,"
2008, pp. 439-443.
[13] M. Heravizadeh, J. Mendling, M. Rosemann, "Dimensions of
business processes quality (QoBP)," 2009, pp. 80-91.
[14] D. McGilvray, Executing data quality projects: Ten steps to quality
data and trusted information: Morgan Kaufmann, 2008.
[15] R. Y. Wang and D. M. Strong, "Beyond accuracy: What data quality
means to data consumers," Journal of management information
systems, vol. 12, pp. 5-33, 1996.
[16] M. Bovee, R.P. Srivastava, B. Mak, "A conceptual framework and
belieffunction approach to assessing overall information quality,"
International journal of intelligent systems, vol. 18, pp. 51-74, 2003.
[17] T. C. Redman, Data quality for the information age: Artech House,
[18] M. Jarke, Fundamentals of data warehouses: Springer Verlag, 2003.
[19] D. P. Ballou and H. L. Pazer, "Modeling data and process quality in
multi-input, multi-output information systems," Management science,
pp. 150-162, 1985.
[20] F. Naumann, Quality-driven query answering for integrated
information systems vol. 2261: Springer Verlag, 2002.
[21] L. Liu and L. Chi, "Evolutionary data quality," 2002.
[22] L. L. Pipino, Y.W. Lee, R.Y. Wang, "Data quality assessment,"
Communications of the ACM, vol. 45, pp. 211-218, 2002.
[23] S. Knight and J. Burn, "Developing a framework for assessing
information quality on the world wide web," Informing Science:
International Journal of an Emerging Transdiscipline, vol. 8, pp. 159-
172, 2005.
[24] V. Peralta, "Data quality evaluation in data integration systems,"
Université de Versailles (chair) Raúl RUGGIA Professor,
Universidad de la República, Uruguay, 2008.
[25] D. Barone, et al., "Dependency discovery in data quality," 2010, pp.
... Component Data Objects Data are generated from raw data objects and stored momentarily until the final product is manufactured Information Products Data that is the concern of performing manufacturing action on data Adapted from Sidi et al. [12]. ...
... Sidi et al. [12] further classified data quality problems into two different groups: single-source and multi-source problems. Previous research also categorizes data quality into four different forms, as illustrated in Table 3. ...
... The way data is essential in some institu-Journal of Computer and Communications tions is the same way its quality must be checked. Sidi et al. [12] focus on a review of data quality dimensions to be employed for a proposed framework that combines statistical methods and data mining to evaluate dependencies present between dimensions; this study further illustrates how retrieving knowledge can increase process quality. ...
Full-text available
Several organizations have migrated to the cloud for better quality in business engagements and security. Data quality is crucial in present-day activities. Information is generated and collected from data representing real-time facts and activities. Poor data quality affects the organizational decision-making policy and customer satisfaction, and influences the organization’s scheme of execution negatively. Data quality also has a massive influence on the accuracy, complexity and efficiency of the machine and deep learning tasks. There are several methods and tools to evaluate data quality to ensure smooth incorporation in model development. The bulk of data quality tools permit the assessment of sources of data only at a certain point in time, and the arrangement and automation are consequently an obligation of the user. In ensuring automatic data quality, several steps are involved in gathering data from different sources and monitoring data quality, and any problems with the data quality must be adequately addressed. There was a gap in the literature as no attempts have been made previously to collate all the advances in different dimensions of automatic data quality. This limited narrative review of existing literature sought to address this gap by correlating different steps and advancements related to the automatic data quality systems. The six crucial data quality dimensions in organizations were discussed, and big data were compared and classified. This review highlights existing data quality models and strategies that can contribute to the development of automatic data quality systems.
... Although the definition of data quality varies in the literature, it is undisputed that data quality depends on many different factors and does not only concern accuracy. In [12], different data quality aspects and definitions from 1985 to 2009 were studied and 40 dimensions were identified, including timeliness, currency, accuracy and completeness, to name the most referenced. These are also reflected in [11], in which a hierarchical data quality framework was formulated from the perspective of data users. ...
... • Availability: This refers to timeliness, also mentioned in [12], and accessibility (also a part of the FAIR principles Findable, Accessible, Interoperable and Re-usable [13]). • Usability: This includes documentation, credibility and metadata. ...
Conference Paper
Full-text available
Data-driven systems and machine learning-based decisions are becoming increasingly important and are having an impact on our everyday lives. The prerequisite for this is good data quality, which must be ensured by preprocessing the data. However, a number of challenges arise in the process. These include the results of the process in terms of data quality, e.g., combating bias and ensuring fairness, and the preprocessing process itself. Here, human involvement and the lack of intelligent solutions and applications for domain experts without in-depth IT knowledge play a major role. This paper summarizes these challenges and provides an overview of the current state of the art. It proposes the design of a holistic tool, along with the necessary tasks to overcome these challenges and to support data preprocessing.
... Data issues are becoming more problematic, as around 60% of organizations face critical issues from bad DQ, and every individual organization may contain 10-30% of inaccurate data in their databases. As stated by [2] "DQ is generally described as the capability of data to satisfy stated and implied needs when used under specified conditions". Low-level DQ can cause inaccurate or missing data. ...
... These impurities of parent and children's nodes are calculated by using the Gini index. Consider the classes for samples from set ; then the Gini index values are calculated using (2). ...
Full-text available
Recently, the industry of healthcare started generating a large volume of datasets. If hospitals can employ the data, they could easily predict the outcomes and provide better treatments at early stages with low cost. Here, data analytics (DA) was used to make correct decisions through proper analysis and prediction. However, inappropriate data may lead to flawed analysis and thus yield unacceptable conclusions. Hence, transforming the improper data from the entire data set into useful data is essential. Machine learning (ML) technique was used to overcome the issues due to incomplete data. A new architecture, automatic missing value imputation (AMVI) was developed to predict missing values in the dataset, including data sampling and feature selection. Four prediction models (i.e., logistic regression, support vector machine (SVM), AdaBoost, and random forest algorithms) were selected from the well-known classification. The complete AMVI architecture performance was evaluated using a structured data set obtained from the UCI repository. Accuracy of around 90% was achieved. It was also confirmed from cross-validation that the trained ML model is suitable and not over-fitted. This trained model is developed based on the dataset, which is not dependent on a specific environment. It will train and obtain the outperformed model depending on the data available.
... Data quality is one of them, and it needs to be considered more in the context of big data. In [2], [3], the authors point out that to improve data quality, two strategies are (1) datadriven and (2) process-driven. A data-based approach deals with data, using methods and activities as a purge to improve its quality. ...
Conference Paper
Full-text available
In many health care domains, big data has arrived. How to manage and use big data better has become the focus of all walks of life. Many data sources provide the repeated fault data—the repeated fault data forming the delay of processing time and storage capacity. Big data includes properties like volume, velocity, variety, variability, value, complexity, and performance put forward more challenges. Most healthcare domains face the problem of testing for structured and unstructured data validation in big data. It provides low-quality data and delays in response. In testing process is delay and not provide the correct response. In Proposed, pre-testing and post-testing are used for big data testing. In pre-testing, classify fault data from different data sources. After Classification to group big data using SVM algorithms such as Text, Image, Audio, and Video file. In post-testing, to implement the pre-processing, remove the zero file size, unrelated file extension, and de-duplication after pre-processing to implement the Map-reduce algorithm to find out the big data efficiently. This process reduces the pre-processing time reduces the server energy, and increases the processing time. To remove the fault data before pre-processing means to increase the processing time and data storage
AI becomes a key enabler for Industry 4.0. Data / information quality become a real cornerstone on the overall process from user expectation to products / systems / solutions in a consistent perspective in order to ensure quality of the manufacturing production. This paper highlights some key characteristics in terms of information quality required to implement an effective AI based monitoring framework, in order to achieve operational excellence in Industry.
Full-text available
Batch processing reduces processing time in a business process at the expense of increasing waiting time. If this trade-off between processing and waiting time is not analyzed, batch processing can, over time, evolve into a source of waste in a business process. Therefore, it is valuable to analyze batch processing activities to identify waiting time wastes. Identifying and analyzing such wastes present the analyst with improvement opportunities that, if addressed, can improve the cycle time efficiency (CTE) of a business process. In this paper, we propose an approach that, given a process execution event log, (1) identifies batch processing activities, (2) analyzes their inefficiencies caused by different types of waiting times to provide analysts with information on how to improve batch processing activities. More specifically, we conceptualize different waiting times caused by batch processing patterns and identify improvement opportunities based on the impact of each waiting time type on the CTE. Finally, we demonstrate the applicability of our approach to a real-life event log.
Nowadays, businesses in many industries face an increasing flow of data and information. Data are at the core of the decision-making process, hence it is vital to ensure that the data are of high quality and no noise is present. Outlier detection methods are aimed to find unusual patterns in data and find their applications in many practical domains. These methods employ different techniques, ranging from pure statistical tools to deep learning models that have gained popularity in recent years. Moreover, one of the most popular outlier detection techniques are machine learning models. They have several characteristics which affect the potential of their usefulness in real-life scenarios. The goal of this paper is to add to the existing body of research on outlier detection by comparing the isolation forest, DBSCAN and LOF techniques. Thus, we investigate the research question: which ones of these outlier detection models perform best in practical business applications. To this end, three models are built on 12 datasets and compared using 5 performance metrics. The final comparison of the models is based on the McNemar’s test, as well as on ranks per performance measure and on average. Three main conclusions can be made from the benchmarking study. First, the models considered in this research disagree differently, i.e. their type I and type II errors are not similar. Second, considering the time, AUPRC and sensitivity metrics, the iForest model is ranked the highest. Hence, the iForest model is the best in the cases when time performance is a key consideration as well as when the opportunity costs of not detecting an outlier are high. Third, the DBSCAN model obtains the highest ranking along the F1 score and precision dimensions. That allows us to conclude that if raising many false alarms is not an important concern, the DBSCAN model is the best to employ.
This paper presents the “CDAC AI life cycle,” a comprehensive life cycle for the design, development, and deployment of artificial intelligence (AI) systems and solutions. It addresses the void of a practical and inclusive approach that spans beyond the technical constructs to also focus on the challenges of risk analysis of AI adoption, transferability of prebuilt models, increasing importance of ethics and governance, and the composition, skills, and knowledge of an AI team required for successful completion. The life cycle is presented as the progression of an AI solution through its distinct phases—design, develop, and deploy—and 19 constituent stages from conception to production as applicable to any AI initiative. This life cycle addresses several critical gaps in the literature where related work on approaches and methodologies are adapted and not designed specifically for AI. A technical and organizational taxonomy that synthesizes the functional value of AI is a further contribution of this article.
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e. high volume, high variety, and high velocity problems. The surveyed works include distributed solutions capable of operating on data sets of arbitrary sizes, deep learning techniques for large-scale data sets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.
Full-text available
Mobile sensors are increasingly used to monitor air quality to accurately quantify human exposure to air pollution. These sensors are subject to various issues (misuse, malfunctions, battery problems, etc) that are likely to cause data quality problems. These quality problems may have a considerable impact on the reliability of analytical studies. In this work, we address the problem of data quality evaluation and improvement in mobile crowd-sensing environments. Our work is focused on the data completeness quality dimension. We introduce a multi-dimensional model to represent the data coming from the sensors in this context, and then present the different facets of data completeness inspired by the model. We propose quality indicators capturing different facets of completeness along with their corresponding quality metrics. We also propose an approach to improve data completeness by extending two existing data imputation techniques, SVDImpute and KNNImpute, with information about the sensor quality. Our experiments show that our quality-aware imputation approach improves the accuracy of the imputation achieved by the original techniques.
Full-text available
This book is an introduction and source book for practitioners, graduate students, and researchers interested in the state of the art and the state of the practice in data warehousing. It resulted from our observation that, while there are a few hands-on practitioner books on data warehousing, the research literature tends to be fragmented and poorly linked to the commercial state of practice. As a result of the synergistic view taken in the book, the last chapter presents a new approach for data warehouse quality assessment and quality-driven design which reduces some of the recognized shortcomings. For the reader, it will be useful to be familiar with the basics of the relational model of databases to be able to follow this book. The book is made up from seven chapters. Chapter 1 sets the stage by giving a broad overview of some important terminology and vendor strategies. Chapter 2 summarizes the research efforts in data warehousing and gives a short description of the framework for data warehouses used in this book. The next two chapters address the main data integration issues encountered in data warehousing: Chapter 3 presents a survey of the main techniques used when linking information sources to a data warehouse, emphasizing the need for semantic modeling of the relationships. Chapter 4 investigates the propagation of updates from operational sources through the data warehouse to the client analyst, looking both at incremental update computations and at the many facets of refreshment policies. The next two chapters study the client-side of a data warehouse: Chapter 5 shows how to reorganize relational data into the multidimensional data models used for OLAP applications, focusing on the conceptualization of, and reasoning about multiple, hierarchically organized dimensions. Chapter 6 takes a look at query processing and its optimization, taking into account the reuse of materialized views and the multidimensional storage of data. In the literature, there is not much coherence between all these technical issues on the one side, and the business reasons and design strategies underlying data warehousing projects. Chapter 7 ties these aspects together. It presents an extended architecture for data warehousing and links it to explicit models of data warehouse quality. It is shown how this extended approach can be used to document the quality of a data warehouse project, and to design a data warehouse solution for specific quality criteria.
Full-text available
Data and information obtained from data analysis is an essential asset to construct and support information systems. As data is a significant resource, the quality of data is critical to enhance data quality and increase the effectiveness of busi-ness processes. Relationships among all four major data quality dimensions for process improvement are often neglected. For this reason, this study proposes to construct a reliable framework to support process activities in information systems. This study focuses on four critical quality dimensions; accuracy, completeness, consistency, and timeliness. A qualitative approach was conducted using a questionnaire and the responses were assessed to measure reliability and validity of the survey. Factor analysis and Cronbach-alpha test were applied to interpret the results. The results show that the items of each data quality dimension and improvement process are reliable and valid. This framework can be used to evaluate data quality in an information system to improve the involved process.
Full-text available
This work focuses on the increasing importance of data quality in organizations, especially in digital manufacturing companies. The paper firstly reviews related works in field of data quality, including definition, dimensions, measurement and assessment, and improvement of data quality. Then, by taking the digital manufacturing as research object, the different information roles, information manufacturing processes, influential factors of information quality, and the transformation levels and paths of the data/information quality in digital manufacturing companies are analyzed. Finally an approach for the diagnosis, control and improvement of data/information quality in digital manufacturing companies, which is the basis for further works, is proposed.
Poor data quality can seriously hinder or damage the efficiency and effectiveness of organizations and businesses. The growing awareness of such repercussions has led to major public initiatives like the "Data Quality Act" in the USA and the "European 2003/98" directive of the European Parliament. Batini and Scannapieco present a comprehensive and systematic introduction to the wide set of issues related to data quality. They start with a detailed description of different data quality dimensions, like accuracy, completeness, and consistency, and their importance in different types of data, like federated data, web data, or time-dependent data, and in different data categories classified according to frequency of change, like stable, long-term, and frequently changing data. The book's extensive description of techniques and methodologies from core data quality research as well as from related fields like data mining, probability theory, statistical data analysis, and machine learning gives an excellent overview of the current state of the art. The presentation is completed by a short description and critical comparison of tools and practical methodologies, which will help readers to resolve their own quality problems. This book is an ideal combination of the soundness of theoretical foundations and the applicability of practical approaches. It is ideally suited for everyone – researchers, students, or professionals – interested in a comprehensive overview of data quality issues. In addition, it will serve as the basis for an introductory course or for self-study on this topic.
Data warehouses have captured the attention of practitioners and researchers alike. But the design and optimization of data warehouses remains an art rather than a science. This book presents the first comparative review of the state of the art and best current practice of data warehouses. It covers source and data integration, multidimensional aggregation, query optimization, update propagation, metadata management, quality assessment, and design optimization. Furthermore, it presents a conceptual framework by which the architecture and quality of data warehouse efforts can be assessed and improved using enriched metadata management combined with advanced techniques from databases, business modeling, and artificial intelligence. For researchers and database professionals in academia and industry, the book offers an excellent introduction to the issues of quality and metadata usage in the context of data warehouses. In the second edition, the significant advances of the state of the practice in the last three years are discussed and the conceptual framework is extended to a full methodology for data warehouse design, illustrated by several industrial case studies.
information is essential to effective competition. Recent studies show that data quality problems are costing businesses billions of dollars each year, with poor data linked to waste and inefficiency, damaged credibility among customers and suppliers, and an organizational inability to make sound decisions. In Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray presents a systematic, proven approach to improving and creating data and information quality within the enterprise. She describes a methodology that combines a conceptual framework for understanding information quality with the tools, techniques, and instructions for improving and creating information quality. Her trademarked "Ten Steps" approach applies to all types of data and to all types of organizations.Data quality problems cost businesses billions of dollars each year in unnecessary printing, postage, and staffing costs, in the steady erosion of an organization's credibility among customers and suppliers, and the inability to make sound decisions. Danette McGilvray presents a systematic, proven approach to improving data quality by combining a conceptual framework for understanding information quality with techniques and instructions for improving it. The Ten Step approach applies to all types of data and to all types of organizations. * Includes numerous templates, detailed examples, and practical advice for executing every step of The Ten Steps approach. * Allows for quick reference with an easy-to-use format highlighting key concepts and definitions, important checkpoints, communication activities, and best practices. * A companion Web site includes links to numerous data quality resources, including many of the planning and information-gathering templates featured in the text, quick summaries of key ideas from The Ten Step methodology, and other tools and information that is available online.
Querying the Web.- Integrating Autonomous Information Sources.- Information Quality.- Information Quality Criteria.- Quality Ranking Methods.- Quality-Driven Query Answering.- Quality-Driven Query Planning.- Query Planning Revisited.- Completeness of Data.- Completeness-Driven Query Optimization.- Discussion.- Conclusion.
As a resource, data is the base for information construction and application. According to the principle of “garbage in and garbage out”, it needs us to ensure data reliability, no errors and accurately reflect the real situation to support the right decisions. However, due to various reasons, it leads to poor quality of dirty data in existing system business, while the dirty data is an important factor which affects right decisions. For the above, in this paper, a metadata-based data quality rule base is created for improving traditional quality control model, a more practical application of the weighted assessment algorithm is proposed and a three-tier data quality assessment system model is constructed based on the study of definition and classification of quality, assessment algorithm, metadata and the control theory. This model is confirmed to achieve comprehensive quality of data management and control in oilfield practical applications.
This paper presents a general model to assess the impact of data and process quality upon the outputs of multi-user information-decision systems. The data flow/data processing quality control model is designed to address several dimensions of data quality at the collection, input, processing and output stages. Starting from a data flow diagram of the type used in structured analysis, the model yields a representation of possible errors in multiple intermediate and final outputs in terms of input and process error functions. The model generates expressions for the possible magnitudes of errors in selected outputs. This is accomplished using a recursive-type algorithm which traces systematically the propagation and alteration of various errors. These error expressions can be used to analyze the impact that alternative quality control procedures would have on the selected outputs. The paper concludes with a discussion of the tractability of the model for various types of information systems as well as an application to a representative scenario.