Conference PaperPDF Available

A Formal Definition of Data Quality Problems.

Authors:

Abstract

The exploration of data to extract information or knowledge to support decision making is a critical success factor for an organization in today's society. However, several problems can affect data quality. These problems have a negative effect in the results extracted from data, affecting their usefulness and correctness. In this context, it is quite important to know and understand the data problems. This paper presents a taxonomy of data quality problems, organizing them by granularity levels of occurrence. A formal definition is presented for each problem included. The taxonomy provides rigorous definitions, which are information-richer than the textual definitions used in previous works. These definitions are useful to the development of a data quality tool that automatically detects the identified problems.
A FORMAL DEFINITION OF DATA QUALITY PROBLEMS
(Completed paper)
Paulo Oliveira
DI/gEPL – Languages Specification and Processing Group
University of Minho (Portugal), and
GECAD/ISEP-IPP – Knowledge Engineering and Decision Support Group
Institute of Engineering – Polytechnic of Porto (Portugal)
pjo@isep.ipp.pt
Fátima Rodrigues
GECAD/ISEP-IPP - Knowledge Engineering and Decision Support Group
Institute of Engineering – Polytechnic of Porto (Portugal)
mfc@isep.ipp.pt
Pedro Henriques
DI/gEPL – Languages Specification and Processing Group
University of Minho (Portugal)
prh@di.uminho.pt
Abstract: The exploration of data to extract information or knowledge to support decision making
is a critical success factor for an organization in today’s society. However, several problems can
affect data quality. These problems have a negative effect in the results extracted from data,
affecting their usefulness and correctness. In this context, it is quite important to know and
understand the data problems. This paper presents a taxonomy of data quality problems, organizing
them by granularity levels of occurrence. A formal definition is presented for each problem
included. The taxonomy provides rigorous definitions, which are information-richer than the textual
definitions used in previous works. These definitions are useful to the development of a data quality
tool that automatically detects the identified problems.
Key Words: Data Quality Problems, Formal Definition, Taxonomy
1. INTRODUCTION
Nowadays, public and private organizations understand the value of data. Data is a key asset to improve
efficiency in today’s dynamic and competitive business environment. However, as organizations begin to
create integrated data warehouses for decision support, the resulting Data Quality (DQ) problems become
painfully clear [12]. A study by the Meta Group revealed that 41% of the data warehouse projects fail,
mainly due to insufficient DQ, leading to wrong decisions [8]. The quality of the input data strongly
influences the quality of the results [15] (“garbage in, garbage out” principle).
The concept of DQ is vast, comprising different definitions and interpretations. DQ is essentially studied
in two research communities: databases and management. The first one studies DQ from a technical point
of view (e.g., [4]), while the second one is also concerned with other aspects or dimensions (e.g.,
accessibility, believability, relevancy, interpretability, objectivity) involved in DQ (e.g., [13, 17]). In the
context of this paper we follow the databases perspective, i.e., DQ means just the quality of the data
values or instances.
DQ problems are also labeled of errors, anomalies or even dirtiness and enclose, among others, missing
attribute values, incorrect attribute values, or different representations of the same data. It is not
uncommon for operational databases to have 60% to 90% of bad data [3]. These problems are an obstacle
to effective data usage and, as already said, negatively affect the results and conclusions obtained.
Therefore, before using an analysis-oriented tool, data must be examined to check whether the required
quality is assured. If not, DQ must be improved by removing or repairing any problems that may exist
[10].
DQ problems concerns arise in three different contexts [4]: (i) when one wants to correct anomalies in a
single data source, as files and databases (e.g., duplicate elimination in a file); (ii) when poorly structured
or unstructured data is migrated into structured data; or (iii) when one wants to integrate data coming
from multiple sources into a single new data source (e.g., data warehouse construction). In the last
situation, the problems are even more critical because distinct data sources frequently contain redundant
data under different representations. The representations must be consolidated and the duplicates removed
to provide an accurate and consistent access to data.
The paper reports the new developments of our work, initially presented in [11]. The main contributions
provided here are: (i) a formal definition for each DQ problem and (ii) a taxonomy of DQ problems that
organizes them by granularity levels of occurrence.
In [7, 9, 14] a comprehensive list of problems that affect DQ is presented. However, the problems are
only described through a textual definition. It is commonly accepted that natural language tends to be
ambiguous by nature. Therefore, doubts about the true meaning of some DQ problems arise, which means
that they need to be further clarified. Using a formal definition is a suitable approach to specify each DQ
problem in a rigorous way.
Besides rigorous, this kind of definition is also useful because has more extra information than a textual
definition. The definition makes explicit: (i) that it concerns just with a given data type (e.g. string data
type); (ii) the metadata knowledge needed to detect a problem (e.g. the attribute domain); (iii) the
mathematical expression that specifies the DQ problem, which can be computationally translated to
automate its detection (just for illustration purposes we show the definition of domain violation later
presented: t r : v(t,a) Dom(a)); and (iv) eventually a required function that allows to detect the DQ
problem (e.g. to detect a misspelling error, a spell checker function must be available). A framework for
DQ problems is our first step towards the development of an automated tool for detecting the problems.
With this tool we intend to complement the capabilities of today’s data profiling tools.
We argue that a taxonomy of DQ problems is important because: (i) it is useful to understand how far a
given DQ tool is able to go in detecting and correcting DQ problems, i.e., it allows to measure the
coverage of a DQ tool; (ii) guides the research efforts, emphasizing the DQ problems that deserve further
attention, i.e., if a DQ problem has no detection or correction support, this means that research attention
should be given to it.
The paper is organized as follows. Section 2 presents in detail our taxonomy, organized by granularity
levels of DQ problems. Section 3 compares our taxonomy to related work. Finally, in Section 4,
conclusions and some future work directions are described.
2. DATA QUALITY PROBLEMS
Figure 1 presents the well known typical model of data organization: (i) data is stored in multiple data
sources; (ii) a data source is composed of several relations and relationships are established among them;
(iii) a single relation is made up of several tuples; and (iv) a tuple is composed by a predefined number of
attributes. This model results in a hierarchy of four levels of data granularity: multiple data sources;
multiple relations; single relation; and attribute/tuple.
We have identified the DQ problems and created the taxonomy based on this model. Using real-world
data from the retail sector, we thoroughly analyzed each granularity level, from the lowest (attribute/
tuple) to the highest (multiple data sources), to detect specific DQ problems. The purpose was to tackle
first the most specific and easier to detect DQ problems and leave to the end the most generic and difficult
ones. The analysis was based on the fundamental elements of this model of data organization (e.g., data
types, relationships, data representation structure). The systematic approach used supports our conviction
that the taxonomy is complete.
The DQ problems identified are presented through a rigorous definition and properly illustrated with an
example. Our taxonomy covers the problems that affect data represented in a tabular format, i.e., at least
in the first normal form. The types of data considered are: numeric, date/time, enumerated, and string.
Multimedia data was excluded, since it requires a special kind of manipulation.
2.1 Preliminaries
We start by introducing the notation used throughout the paper, following [1]. A relation schema consists
of a name R (the relation name), along with a list A = a1,a2,...,an of distinct attribute names, represented by
R(a1,a2,...,an) or simply by R(A). The integer n represents the relation schema degree. A data source DS is
composed by a set of m relation schemas R1(A1), R2(A2),...,Rm(Am). A domain d is a set of atomic values.
Given a set D of domains, a set A of attribute names, we assume there is a function Dom: A D that
associates attributes with their domains. The function Dom is applied to the set of attribute names, which
is represented by Dom(A) = Dom(a1,a2,...,an) = Dom(a1) × Dom(a2) × ...×Dom(an). A relation instance (or
...
Tuple
1
Relation
1
Relation
2... Relation
n
at. 1
Tuple
2 ...
at. 2 at. n ...
at. 1 at. 2 at. n at. 1 at. 2 at. n
... ...
...
... ...
Data
Source
1
Data
Source
2
Data
Source
n
Tuple
n
... Attribute/tuple
Single relation
Multiple data sources
Multiple relations
Fi
g
ure 1: T
yp
ical model of data or
g
anization
relation, for short) is a finite set r Dom(a1Dom(a2)×...×Dom(an) and is represented by r(a1,a2,...,an) or
simply by r(A). Each element of r is called a tuple, and is represented by t. A tuple t can be viewed as a
function that associates a value of Dom(a1) with a1, a value of Dom(a2) with a2,... and a value of Dom(an)
with an. The value of attribute a in tuple t is represented by v(t,a). The values in tuple t of attributes
a1,a2,...,an i.e., v(t,a1),v(t,a2),...,v(t,an) are denoted by v(t,A). By analogy, the values attribute a take for all
tuples are represented by v(T,a). The application of a function f over a value v is represented by f(v) while
over a set of values V is represented by f(V). If, for some reason, a value transformation cannot be done,
the function f acts as the identity function, i.e., f(v) = v or f(V) = V. The data type of attribute a is denoted
by type(a).
2.2 DQ Problems at the Level of Attribute/Tuple
This level is divided into three groups of DQ problems that were encountered by analyzing the value(s)
of: (i) a single attribute of a single tuple; (ii) a single attribute in multiple tuples (a column); and (iii)
multiple attributes of a single tuple (a row).
2.2.1 Single Attribute of a Single Tuple
The following DQ problems were detected by analyzing the values of single attributes (with different data
types) in single tuples (as presented in Figure 2).
Missing value
Definition: Let S be a set of attribute names, defined as: S = {a | a R(A) a is a mandatory
attribute}, i.e., S R(A). There is a missing value in attribute a S if and only if (iff): t r :
v(t,a) = null.
Example: Absence of value in the mandatory attribute Name of a customer.
Note: The absence of value in an optional attribute is not considered by us as a DQ problem.
Syntax violation
Definition: Let G(a) be the syntax of attribute a, given by a grammar or a regular expression. Let
L(G(a)) be the language generated by the grammar or regular expression. There is a syntax violation
in attribute a R(A) iff: t r: v(t,a) L(G(a)).
Example: The attribute Order_Date contains the value 13/12/2004, instead of 2004/12/13.
Incorrect value
Definition: Let u(t,a) be the correct and updated value that the attribute a of tuple t was supposed
to have. There is an incorrect value in attribute a R(A) iff: t r : v(t,a) Dom(a) v(t,a)
u(t,a).
...
Tuple
1
at. 1
Tuple
2 ...
at. 2 at. n ...
at. 1 at. 2 at. n at. 1 at. 2 at. n
Tuple
n
...
...
Figure 2: A single attribute of a single tuple
Example: The attribute Creation_Date contains the value 23/09/2003, instead of 23/09/2004.
Domain violation
Definition: There is a domain violation in attribute a R(A) iff: t r : v(t,a) Dom(a).
Example: In a given order, the attribute Ordered_Quantity contains a negative value.
Violation of business domain constraint
Definition: Let check be a function that receives an attribute value, checks whether it respects a
given constraint, and returns a boolean value. There is a violation of business domain constraint in
attribute a R(A) iff: t r : check(v(t,a)) = false.
Example: The attribute Name of a customer must have, at least, two words; however, in a certain
tuple this constraint is not respected.
A domain violation in an attribute whose data type is string, may be further detailed as presented next.
Let S be a set of attribute names, defined as: S = {a | a R(A) type(a) = string}, i.e., S R(A). This set
is used in the following definitions.
Invalid substring
Definition: Let v’(t,a) be a substring of v(t,a). There is an invalid substring in attribute a S iff: t
r : v(t,a) Dom(a) v’(t,a) Dom(a).
Example: The attribute Customer_Name also stores the academic degree (e.g., Dr. John Taylor).
Misspelling error
Definition: Let spell be a spelling checker function that receives a misspelled word, looks-up for
the correct word based on a language dictionary, and returns it. There is a misspelled error in
attribute a S iff: t r : v(t,a) Dom(a) spell(v(t,a)) Dom(a).
Example: The attribute Address_Place contains the value Sant Louis, instead of Saint Louis.
Imprecise value
Definition: Let translate be a function that receives an abbreviation or an acronym, looks-up for its
meaning (in full words) in a dictionary (lookup table), and returns it. There is an imprecise value in
attribute a S iff: t r : v(t,a) Dom(a) translate(v(t,a)) Dom(a).
Example: The value Ant. in attribute Customer_Contact may represent Anthony, Antonia, etc.
2.2.2 Single Attribute in Multiple Tuples
The following DQ problems were identified by analyzing the values of a single attribute in multiple
tuples, as illustrated in Figure 3.
Unique value violation
Definition: Let S be a set of attribute names, defined as: S = {a | a R(A) a is a unique value
attribute}, i.e., S R(A). There is a unique value violation in attribute a S iff: t1, t2 r : v(t1,a)
= v(t2,a) t1 t2 .
Example: Two different customers have the same taxpayer identification number.
Existence of synonyms
Definition: Let S be a set of attribute names, defined as: S = {a | a R(A) type(a) = string}, i.e.,
S R(A). Let meaning be a function that receives a word, looks-up for its meaning in a dictionary,
and returns it. There are synonyms in attribute a S iff: t1, t2 r : v(t1,a) v(t2,a)
meaning(v(t1,a)) = meaning(v(t2,a)).
Example: The attribute Occupation contains the values Professor and Teacher in different tuples,
which in fact represent the same occupation.
Violation of business domain constraint
Definition: Let check be a function that receives the set of all values of an attribute, checks whether
a given constraint is respected, and returns a boolean value. There is a violation of business domain
constraint in attribute a R(A) : check(v(T,a)) = false.
Note: As defined in section 2.1, v(T,a) represents the values that attribute a take for all tuples.
Example: The values of attribute Invoice_Date must appear in the relation by ascending order, but
this does not happens.
2.2.3 Multiple Attributes of a Single Tuple
The following DQ problems were identified by analyzing the values of multiple attributes of a single
tuple, as illustrated in Figure 4.
Semi-empty tuple
Definition: Let θ be a user-defined threshold (a real number between 0 and 1), and S the set of
attribute names that are empty in tuple t, defined as: S = {a | a R(A) v(t,a) = null}, i.e., S
R(A). Let m be the cardinality of set S, defined as: m = |S|, and n be the relation schema degree. The
tuple t is a semi-empty tuple iff: m/n θ.
...
Tuple
1
at. 1
Tuple
2 ...
at. 2 at. n ...
at. 1 at. 2 at. n at. 1 at. 2 at. n
Tuple
n
...
...
Figure 3: A single attribute in multiple tuples
Example: If 60% or more of the tuple attributes are empty, then the tuple is classified as semi-
empty
Violation of functional dependency
Definition: Let a2 be an attribute whose value functionally depends on the values of other
attributes. The set of these attribute names is defined as: S = {a1 | a1, a2 R(A) : the value of a2
functionally depends on the value of a1}, i.e., S R(A). Let value be a function that receives a set
of values of a tuple, computes the value of the functional dependent attribute, and returns it. There
is a violation of functional dependency in tuple t iff: t r : value(v(t,S)) v(t,a2).
Example: There is a functional dependency among Zip_Code and City. Each value of the first
attribute must be associated with exactly one value of the second. Therefore, the following values
of two customer tuples violate the functional dependency: (Zip_Code = 4000; City = “Porto”) and
(Zip_Code = 4000; City = “Lisboa”).
Violation of business domain constraint
Definition: Let check be a function that receives the set of values of a tuple, checks whether a
given constraint x is respected, and returns a boolean value. Let S be a set of attribute names,
defined as: S = {a | a R(A) a is used in the formulation of x}, i.e., S R(A). There is a violation
of business domain constraint in tuple t r iff: check(v(t,S)) = false.
Note: As defined in section 2.1, v(t,S) represents the values in tuple t of the attributes that belong to
S.
Example: The business domain constraint among attribute values: Total_Product = Quantity *
Sell_Price, does not hold for a given tuple of the Sales_Details relation.
2.3 DQ Problems at the Level of a Single Relation
The DQ problems described in this section were identified by analyzing the values of multiple attributes
in multiple tuples of a relation, as illustrated in Figure 1.
Approximate duplicate tuples
Definition: Let S be a set of attribute names, defined as: S = {a | a R(A) a does not belong to
the primary key}, i.e., S R(A). Let θ be a real number between 0 and 1. Let similarity be a
function that receives two values of an attribute, computes the similarity among them, and returns it
(also as a real number between 0 and 1). There are approximate duplicate tuples in relation r iff:
...
Tuple
1
at. 1
Tuple
2 ...
at. 2 at. n ...
at. 1 at. 2 at. n at. 1 at. 2 at. n
Tuple
n
...
...
Figure 4: Multiple attributes of a single tuple
t1, t2 r a S : similarity(v(t1,a),v(t2,a)) θ t1 t2.
Example: The tuple Customer(10, ‘Smith Barney’, ‘Flowers Street, 123’, 502899106) is an
approximate duplicate of the tuple Customer(72, ‘S. Barney’, ‘Flowers St., 123’, 502899106).
Inconsistent duplicate tuples
Definition: Let S be a set of attribute names, defined as: S = {a | a R(A) a does not belong to
the primary key}, i.e., S R(A). Let θ be a real number between 0 and 1. Let similarity be a
function that receives two values of an attribute, computes the similarity between them, and returns
it (also as a real number between 0 and 1). There are inconsistent duplicate tuples in relation r iff:
a2 S, t1, t2 r a1 S\{a2}: similarity(v(t1,a1),v(t2,a1)) θ similarity(v(t1,a2), v(t2,a2)) < θ.
Example: The tuple Customer(10, ‘Smith Barney’, ‘Flowers Street, 123’, 502899106) is an
inconsistent duplicate of the tuple Customer(72, ‘Smith Barney’, ‘Sun Street, 321’, 502899106).
Violation of business domain constraint
Definition: Let check be a function that receives the attribute values of all tuples, checks whether a
given constraint x is respected, and returns a boolean value. Let S be a set of attribute names,
defined as: S = {a | a R(A) a is used in the formulation of x}, i.e., S R(A). There is a violation
of business domain constraint in tuple t iff: check(v(T,S)) = false.
Note: v(T,S) represents the values that the attributes belonging to S take for all tuples.
Example: The maximum number of products families allowed in relation Products_Families is 10,
but the existent number of families is 12.
2.4 DQ Problems at the Level of Multiple Relations
In this section, we present the DQ problems detected when analyzing the values from multiple relations,
as presented in figure 1.
We assume there is a relationship among the relation schemas R1(A1) and R2(A2) of a data source DS. Let
S and T be sets of attribute names, defined as: S = {a | a R1(A1) a belongs to the foreign key that
establishes the relationship with R2(A2)}, i.e., S R1(A1), and T = {a | a R
2(A2) a belongs to the
primary key}, i.e., T R2(A2). These two sets are used in the following definitions.
Referential integrity violation
Definition: Let V be the set of values of the primary key attributes, defined as: V = {v(t,T) | t r2}.
There is a referential integrity violation among relations r1 and r2 iff: t r1 : v(t,S) V.
Example: The attribute Customer_Zip_Code of the Customer relation contains the value 5100,
which does not exists in the Zip_Code relation.
Incorrect reference
Definition: Let V be the set of values of the primary key attributes, defined as: V = {v(t,T) | t r2}.
Let u(t,S) be the correct and updated value that was supposed to be in the foreign key S of tuple t of
relation r1. There is an incorrect reference among relations r1 and r2 iff: t r1 : v(t,S) V v(t,S)
u(t,S).
Example: The attribute Customer_Zip_Code of the Customer relation contains the value 4415,
instead of 4445; both zip codes exist in the Zip_Code relation.
Heterogeneity of syntaxes
Definition: Let G(a) be the syntax of attribute a, given by a grammar or a regular expression. There
is a heterogeneity of syntaxes among relations r1 and r2 iff:
a1 R1(A1), a2 R2(A2) : type(a1) =
type(a2) G(a1) G(a2).
Example: The attribute Order_Date of relation Orders has the syntax dd/mm/yyyy, while the
attribute Invoice_Date of relation Invoices has the syntax yyyy/mm/dd.
Circularity among tuples in a self-relationship
Definition: Let U be a set of attribute names, defined as: U = {a | a R1(A1) a belongs to the
primary key}, i.e., U R1(A1). Let V be the set that contains the primary key values of all existing
tuples in r1, defined as: V = {v(t,U) | t r1}. Let v be the value of a primary key: v V. Let W be
the set that, starting from the tuple identified by the primary key v, contains the foreign key values
of all other tuples related with it, defined as: W = {v(t1,S) | v(t1,S) = v(t2,U) t1, t2 r1}. There is a
circularity among tuples in a self-relationship in relation r1 iff: v W.
Example: A product may be a sub-product in another product and this information is stored in
attribute Sub-product_Cod of the product; In relation Products there exists the information that
product X (Product_Cod = ‘X’) is sub-product of Y (Sub-product_Cod = ‘Y’) and simultaneously
that product Y (Product_Cod = ‘Y’) is sub-product of X (Sub-product_Cod = ‘X’); this is an
impossible situation.
Violation of business domain constraint
Definition: Let check be a function that receives the attribute values of the tuples from relations r1
and r2, checks whether a given constraint x is respected, and returns a boolean value. Let U and V
be sets of attribute names, defined as: U = {a | a R1(A1) a is used in the formulation of x}, i.e.,
U R1(A1), and V = {a | a R2(A2) a is used in the formulation of x}, i.e., V R2(A2). Let W and
Z be sets of attribute values of related tuples from each relation, defined as: W = {v(t1,U) | v(t1,S) =
v(t2,T) t1 r1 t2 r2} and Z = {v(t2,V) | v(t2,T) = v(t1,S) t1 r1 t2 r2}. There is a violation
of business domain constraint among relations r1 and r2 iff: check(W, Z) = false.
Example: The attribute Invoice_Total of a tuple of the relation Invoices contains the value 100,
while the sum of the values of attribute Product_Value (for each product of the invoice) of the
relation Invoices_Details is only equal to 90 (instead of 100).
2.5 DQ Problems at the Level of Multiple Data Sources
The DQ problems presented below were identified by analyzing the values of multiple data sources, as
illustrated in Figure 1. As referred in section 1, this paper only addresses the DQ problems related with
the instances (values) of data, i.e., the extensional level [6]. There are other kinds of DQ problems that
occur at the intensional level, i.e., problems related with the structure of data [6], also known as problems
among data schemas. For the reader interested in these problems, we suggest the work of Kashyap and
Sheth [5].
In this section, we assume that the relation schemas R1(A1) and R2(A2) belong to two different data
sources, respectively, DS1 and DS2. Both schemas concern the same real-world entity (e.g., customers).
We also assume that relation schema heterogeneities among DS1 and DS2 are solved, i.e., two attributes
referring to the same real-world property (e.g., unitary price) have the same name. However, the number
of attributes used in each data schema may be different.
Heterogeneity of syntaxes
Definition: Let G(a) be the syntax of attribute a, given by a grammar or a regular expression. There
is heterogeneity of syntaxes among relations r1 and r2 iff :
a1 R1(A1), a2 R2(A2) : a1 = a2
type(a1) = type(a2) G(a1) G(a2).
Example: The attribute Insertion_Date of relation Customers from DS1 has the syntax dd/mm/yyyy,
while the attribute Insertion_Date of relation Customers from DS2, has the syntax yyyy/mm/dd.
Heterogeneity of measure units
Definition: Let S be the set of attribute names common to both relations, defined as: S = {a | a
R1(A1) a R
2(A2) type(a) = numeric}. Let k be a numeric constant value. There is
heterogeneity of measure units in attribute a of relations r1 and r2 iff: a S t1 r1 t2 r2:
v(t1,a) = k * v(t2,a) k > 0 k 1.
Example: The attribute Product_Sell_Price is represented in euros in DS1, while in DS2 is
represented in dollars.
Heterogeneity of representation
Definition: Let S be the set of attribute names common to both relations, defined as: S = R1(A1)
R2(A2). Let translate be a function that receives an attribute value from a relation, looks-up in a
dictionary (lookup table) for the corresponding value in the other relation, and returns it. There is
heterogeneity of representation in attribute a among relations r1 and r2 iff: a S, t1 r1, t2 r2 :
v(t1,a) v(t2,a) translate(v(t1,a)) = v(t2,a).
Example: To represent the attribute Gender the values F and M are used in DS1, while in DS2 are
used the values 0 and 1.
Existence of synonyms
Definition: Let S be the set of attribute names common to both relations, defined as: S = {a | a
R1(A1) a R2(A2) type(a) = string}. Let meaning be a function that receives a word, looks-up
for its meaning in a dictionary, and returns it. There are synonyms in attribute a among relations r1
and r2 iff: a S, t1 r1, t2 r2 : meaning(v(t1,a)) = meaning(v(t2,a)) v(t1,a) v(t2,a).
Example: The relation Occupations of DS1 contains a tuple with Professor, while the equivalent
relation in DS2 contains a tuple with Teacher; both represent the same occupation.
Existence of homonyms
Definition: Let S be the set of attribute names common to both relations, defined as: S = {a | a
R1(A1) a R
2(A2) type(a) = string}. Let meaning1 and meaning2 be functions that receive a
word, look-up for its meaning in a dictionary (in the context of DS1 or DS2), and return it. There are
homonyms in attribute a among relations r1 and r2 iff: a S, t1 r1, t2 r2 : meaning1(v(t1,a))
meaning2(v(t2,a)) v(t1,a) = v(t2,a).
Example: In relation Products of DS1, there exists a product named Mouse (a branch of a company
sells computer hardware), while in relation Products of DS2, there also exists a product named
Mouse (another branch of the company sells domestic animals, so the products here are the animals
themselves).
Approximate duplicate tuples
Definition: Let S be a set of attribute names, defined as: S = {a | a R(A) a does not belong to
the primary key}, i.e., S R(A). Let θ be a real number between 0 and 1. Let similarity be a
function that receives two values of an attribute, computes the similarity between them, and returns
it (also as a real number between 0 and 1). There are approximate duplicate tuples among relations
r1 and r2 iff: t1 r1, t2 r2 a S : similarity(v(t1,a),v(t2,a)) θ.
Example: The tuple Customer(10, ‘Smith Barney’, ‘Flowers Street, 123’, 502899106) in DS1 is
an approximate duplicate of the tuple Customer(27, ‘Smith B.’, ‘Flowers St., 123’, 502899106) in
DS2.
Inconsistent duplicate tuples
Definition: Let S be a set of attribute names, defined as: S = {a | a R(A) a does not belong to
the primary key}, i.e., S R(A). Let θ be a real number between 0 and 1. Let similarity be a
function that receives two values of an attribute, computes the similarity between them, and returns
it (also as a real number between 0 and 1). There are inconsistent duplicate tuples among relations
r1 and r2 iff: t1 r1, t2 r2, a2 S
a1 S\{a2}: similarity(v(t1,a1),v(t2,a1)) θ
similarity(v(t1,a2),v(t2,a2)) < θ.
Example: The tuple Customer(10, ‘Smith Barney’, ‘Flowers Street, 123’, 502899106) in DS1 is an
inconsistent duplicate of the tuple Customer(27, ‘Smith Barney’, ‘Sun Street, 321’, 502899106) in
DS2.
Violation of business domain constraint
Definition: Let check be a function that receives the attribute values of the tuples from relations r1
and r2 (of DS1 and DS2), checks whether a given constraint x is respected, and returns a boolean
value. Let S be the set of attribute names common to both relations, defined as: S = {a | a R1(A1)
a R2(A2) a is used in the formulation of x}. Let T and U be sets that contain the attribute
values of all tuples from relations r1 and r2, defined as: T = {v(t1,S) | t1 r1} and U = {v(t2,S) | t2
r2}. There is a violation of business domain constraint among relations r1 and r2 iff: check(T, U) =
false.
Example: The maximum number of products families allowed is 10; the relation Product_Families
in DS1 contains 7 families, and the relation Product_Families in DS2 contains 8 families; the
number of distinct product families resulting from the integration (union) of both sources is 11; this
number violates the constraint.
2.6 Summary
Table 1 presents a summary of our taxonomy of DQ problems. The problems and the corresponding
granularity levels where they occur are shown in the table.
3. RELATED WORK
Kim et al. [7] present a quite complete taxonomy of DQ problems, describing the logic behind its
structure. They adopt a successive hierarchical refinement approach. The taxonomy is based on the
premise that DQ problems manifest in three different ways: missing data; not missing but wrong data; and
not missing and not wrong but unusable. Unusable data occurs when two or more databases are integrated
or representation standards are not consistently used when entering data. The taxonomy is a hierarchical
decomposition of these three basic manifestations of DQ problems. Considering the approach used and
the DQ problems identified, this taxonomy is the closest to ours.
Attribute/Tuple
Data Quality Problem Attrib. Column Row Single
Relation Multiple
Relations Mult. Data
Sources
Missing value x
Syntax violation x
Incorrect value x
Domain violation x
Invalid substring x
Misspelling error x
Imprecise value x
Violation of business domain
constraint x x x x x x
Unique value violation x
Existence of synonyms x x
Semi-empty tuple x
Violation of functional
dependency x
Approximate duplicate tuples x x
Inconsistent duplicate tuples x x
Referential integrity violation x
Incorrect reference x
Heterogeneity of syntaxes x x
Circularity among tuples in a
self-relationship x
Heterogeneity of measure units x
Heterogeneity of representation x
Existence of homonyms x
Müller and Freytag [9] roughly classify DQ problems into syntactical, semantic, and coverage anomalies.
Syntactical anomalies describe characteristics concerning the syntax and values used for representation of
the entities (e.g. lexical errors, domain syntax errors). Semantic anomalies hinder the data collection from
being a comprehensive and non-redundant representation of the real-world (e.g. duplicates,
contradictions). Coverage anomalies are related with the amount of entities and entities properties from
the real-world actually stored in the data collection (e.g. missing values). This work is limited to DQ
problems that occur in a single relation of a single source, so important DQ problems are not covered.
Rahm and Do [14] distinguish between single-source and multi-source problems as we do. However, at
single-source they do not divide the problems into those that occur in a single relation and those that
occur as a result of existing relationships among multiple relations. Single-source and multi-source
problems are divided into schema-related and instance-related problems. Schema-related problems are
those that can be addressed by improving the schema design, schema translation and schema integration.
Instance-related problems correspond to errors and inconsistencies in the actual data contents that can not
be prevented at the schema level. As referred in the introduction, we are only concerned with the DQ
problems related with the instances of the data, so we do not make this separation. In single source
problems, for both schema-related and instance-related, they distinguish between the following scopes:
attribute; record; record type and source. This is similar to the organization that we present in our
taxonomy.
Table 1: DQ problems organized by granularity level
Even tough the term used may be different (e.g. enterprise constraint violation; business rule violation),
the DQ problem violation of business domain constraint included in our taxonomy is mentioned in almost
every book about databases (e.g. [2, 16]). However, surprisingly it is not included in any of the
taxonomies analysed. The other new DQ problems also introduced by our taxonomy are: (i) semi-empty
tuple; (ii) heterogeneity of syntaxes (at the level of multiple relations and multiple data sources); and (iii)
circularity among tuples in a self-relationship. All the problems identified in the three taxonomies are also
covered by ours, although the names used to label the problems are sometimes different. Finally, the DQ
problems have been described only through a textual description, while we present a rigorous definition
for each problem.
4. CONCLUSION
This paper has presented our taxonomy of DQ problems. The taxonomy results from the research
conducted to identify DQ problems on each granularity level of the usual data organization model. The
study followed a bottom-up approach, from the lowest (attribute/tuple) to the highest granularity level
(multiple data sources) where DQ problems may appear. The taxonomy was also presented in the paper
following that approach. Six groups of related DQ problems were derived from the four granularity
levels. As the approach followed to identify the problems was exhaustive and systematic, it allows us to
be confident that no other problem is missing.
The DQ problems included in our taxonomy were specified through rigorous definitions. This feature
distinguishes our taxonomy from the related ones, since they only use text to describe the problems. We
believe that giving a formal framework to DQ problems is a valuable contribution, since: (a) it is the only
way to assure a clear and precise definition for each DQ problem; and (b) it is useful because it specifies
what is required to detect automatically the problem, i.e.: (i) the metadata knowledge needed; (ii) the
mathematical expression that defines the DQ problem, which can be seen as a logical procedure (rule) to
detect it; and (iii) eventually the function that is required to perform some transformation. These elements
are explicitly included in each definition of DQ problem.
This work is a first step towards the development of a tool to automatically detect DQ problems. The
entire set of DQ problems that may affect data is now known and understood by us. We also know what is
needed to detect each DQ problem and how that can be translated to a computational method. All these
items need to be organized to produce the DQ tool architecture. In fact, this is what we intend to do as our
next work. After, we intend to start the tool development. We believe that it will complement the limited
detection capabilities currently supported by commercial data profiling tools.
ACKNOWLEDGMENTS
We would like to thank Helena Galhardas for the fruitful discussions and useful comments that helped us
to improve the contents of the paper.
REFERENCES
[1] Atzeni, P. and Antonellis, V. – Relational Database Theory. The Benjamin/Cummings Publishing Company,
Inc., 1983.
[2] Connolly, T. and Begg, C. – Database Systems: A Practical Approach to Design, Implementation and
Management. Addison Wesley Longman Limited, 1999. ISBN 0-201-34287-1.
[3] Dasu, T.; Vesonder, G. T. and Wright, J. R. – “Data Quality through Knowledge Engineering”. In Proceedings
of the SIGKDD'03 Conference, Washington. August 2003. pp. 705-710.
[4] Galhardas, H.; Florescu, D.; Shasha, D.; Simon, E. and Saita, C.-A. – “Data Cleaning: Language, Model, and
Algorithms”. In Proceedings of the Very Large Databases Conference (VLDB). 2001.
[5] Kashyap, V. and Sheth, A. – “Schematic and Semantic Similarities Between Database Objects: a Context-
Based Approach”. Very Large Databases Journal, 5 (4). 1996. pp. 276–304.
[6] Kedad, Z. and Métais E. – “Ontology-Based Data Cleaning”. Lecture Notes in Computer Science, 2553, 2002.
pp. 137 – 149.
[7] Kim, W.; Choi, B.-J.; Hong, E.-K.; Kim, S.-K. and Lee, D. – “A Taxonomy of Dirty Data”. Data Mining and
Knowledge Discovery, 7. 2003. pp. 81-99.
[8] Meta Group – Data Warehouse Scorecard. Meta Group, 1999.
[9] Müller, H. and Freytag, J.-C. – “Problems, Methods, and Challenges in Comprehensive Data Cleansing”.
Technical Report HUB-IB-164, Humboldt University, Berlin, 2003.
[10] Oliveira, P.; Rodrigues, F. and Henriques, P. – “Limpeza de Dados: Uma Visão Geral”. In Belo, O.; Lourenço,
A. and Alves, R. (Eds.) – Proceedings of Data Gadgets 2004 Workshop – Bringing Up Emerging Solutions for
Data Warehousing Systems (in conjunction with JISBD’04), Málaga, Spain, November 2004. pp. 39-51 (in
Portuguese).
[11] Oliveira, P.; Rodrigues, F.; Henriques, P. and Galhardas, H. – “A Taxonomy of Data Quality Problems”. In
Proceedings of the 2nd International Workshop on Data and Information Quality (in conjunction with
CAiSE’05), Porto, Portugal, June 2005.
[12] Orr, K. – “Data Quality and Systems Theory”. Communications of the ACM, 41 (2). 1998. pp. 66-71.
[13] Pipino, L.; Lee, Y. and Wang, R. – “Data Quality Assessment”. Communications of the ACM, 45 (4). 2002. pp.
211-218.
[14] Rahm, E. and Do, H. H. – “Data Cleaning: Problems and Current Approaches”. IEEE Bulletin of the Technical
Committee on Data Engineering, 24 (4). 2000.
[15] Sattler, K. and Schallehn, E. – “A Data Preparation Framework based on a Multidatabase Language”. In
Proceedings of International Database Engineering and Applications Symposium (IDEAS 2001), Grenoble,
France. IEEE Computer Society. 2001. pp. 219-228.
[16] Ullman, J. and Widom, J. – A First Course in Database Systems. Prentice-Hall, Inc., 1997. ISBN 0-13-861337-
0.
[17] Wand, Y. and Wang, R. – “Anchoring Data Quality Dimensions in Ontological Foundations”. Communications
of the ACM, 39 (11). 1996. pp. 86-95.
... Number Share Reference (Appendix) [35], [38], [39], [41], [43], [48], [50], [51], [55], [57], [56], [19], [61], [62], [64], [68], [74], [75], [76], [15], [79], [80], [81] [82] ...
... Baden-Württemberg [22], [32], [33], [39], [51], [61], [64], [76], [78] 9 20,5% 11.280.257 ...
... The 44 German cities with a published digital strategy were further investigated as to whether a data strategy exists or whether a concrete project on this is planned or has been started or whether no information on this was available at all. Based on the available online information, the following results were determined (Table 9): [10], [16], [17], [20], [21], [22], [24], [29], [30], [31], [33], [34], [35], [38], [41], [43], [50], [51], [55], [56], [58], [62], [64], [68], [74], [75], [76] Topic "Data strategy" is named ...
Article
Full-text available
This paper presents a comprehensive analysis of smart city strategies in Germany through the examination of online content. Smart city initiatives have gained significant attention worldwide as urban areas seek innovative solutions to address various challenges. In Germany, renowned for its technological achievement and commitment to sustainability, has witnessed the emergence of numerous Smart City projects. The aim of this study is to investigate the current status of Smart Cities strategy initiatives in Germany based on a sample of 82 cities. To achieve this, an online content analysis methodology is employed, using the official web presences of the selected cities and information published on them. The content analysis focuses on identifying empirical evidence related to digitalisation, data and governance, and obtaining their transparency and progress in a sample of 82 major German cities. The findings of this study reveal the diversity and richness of German smart city approaches, but also a varying degree of sophistication and transparency of strategic initiatives. As an expression of Germany's federal structure, the cities focus on different aspects, such as mobility, energy efficiency or data-driven administration. The insights gained from this online content analysis provide valuable guidance for policymakers, urban planners and stakeholders involved in the design of smart city strategies. By understanding the strengths and weaknesses of existing initiatives, policymakers can refine their approaches, address challenges and promote sustainable and inclusive urban development. Furthermore, this study contributes to the academic discourse on smart cities by providing empirical evidence on the implementation of strategies in a German context. Overall, this study contributes to a deeper understanding of the German smart city landscape and serves as a basis for future research, policy formulation and the pursuit of smarter and more liveable cities.
... A study by the Meta Group revealed that 41% of data warehouse projects fail, mainly due to insufficient data quality, leading to wrong decisions [8]. The quality of the input data strongly influences the quality of the results ("garbage in, garbage out" (GIGO) principle. ...
... The quality of the input data strongly influences the quality of the results ("garbage in, garbage out" (GIGO) principle. [9,8]. Hence, a mathematical model is needed to study the reliability of data in Networked and distributed systems. ...
... However, blind faith and reliance on collected data can lead organizations to financial or other losses; therefore, quality becomes an essential success factor for any organization's strategy and decision-making, where precision and accuracy are essential (Martín et al., 2023;Ofner et al., 2013). Various maturity assessments may help organizations assess the data quality and provide improvement guidelines (Gökalp et al., 2021;Oliveira et al., 2005). Kr ol and Zdonek (2020) analyzed 11 maturity models, and according to the authors, most maturity models are process-oriented and look at domain assessment and optimization (Kr ol & Zdonek, 2020;Ofner et al., 2013). ...
Article
Full-text available
The paper examines the adoption of a data‐driven decision‐making (DDDM) process in organizations from purposefully selected European Union (EU) countries. It determines what organizational changes are required to adopt this process in organizations. This study uses a mixed‐method approach to identify organizational changes required for DDDM adoption. The responses from quantitative research in 10 EU countries (1091 respondents) and qualitative research with 20 C‐level managers are analyzed. The study offers the following organizational changes needed to implement DDDM in organizations: culture and mindset changes, digitalization, process improvements, new competencies, re‐organization, and legal requirements. This research contributes to a better understanding of the usage and adoption of DDDM globally and suggests specific organizational changes required to adopt this process.
... However, various issues can affect the data quality that is related to the extraction of data, and these issues can affect both the truth and the usefulness of the data. Because of this, having knowledge of and an understanding of the issues surrounding data quality is quite crucial [34]. The open data paradigm is responsible for much of the success that has been achieved by previously developed technologies; there remains a degree of ambiguity regarding the quality of the data that is derived from the data set. ...
Article
Full-text available
As data size continues to grow, there has been a revolution in computational methods and statistics to process and analyze data into insight and knowledge. This change in the paradigm of analytical data from explicit to implicit raises the way to extract knowledge from data through a prospective approach to determine the value of new observations based on the structure of the relationship between input and output. Data preparation is a very important stage in predictive analytics. To run quality analytical data, data with good quality is needed in accordance with the criteria. Data quality plays an important role in strategic decision making and planning before the digital computer era. The main challenge faced is that raw data cannot be directly used for analysis. One problem that arises related to data quality is completeness. Missing data is one that often causes data to become incomplete. As a result, predictive analysis generated from these data becomes inaccurate. In this paper we will discuss the problems related to the quality of data in predictive analytics through a literature study from related research. In addition, challenges and directions that might occur in the predictive analytics domain with problems related to data quality will be presented.
... They categorized those rules into four categories: (1) Single tuple attributes problems, (2) single relation problems, (3) multiple relations problems, and (4) multiple data sources integration problems. Later, they provided a formal deinition [52] for all quality problems in each category. In addition to formally expressing the intrinsic data quality rules, Bertossi and Bravo [7] [6] provided a formal deinition of the data quality rules related to the data context and entity resolution operations. ...
Article
Full-text available
In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This paper tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole data set each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and 75% over a 25 GB flat file within a distributed environment compared to a non-distributed application.
... Part II -railroad information According to the "garbage in, garbage out" (GIGO) principle, the quality of input data is closely related to the analysis' outputs (Oliveira et al., 2005). Past studies have already highlighted flaws in the veracity of U.S. DOTs crash data (Imprialou and Quddus, 2019;Abay, 2015;Alsop and Langley 2001;Amoros et al., 2006;Austin, 1995). ...
Thesis
Full-text available
Highway-Rail Grade Crossings (HRGCs) present a significant safety risk to motorists, pedestrians, and train passengers as they are intersections where roads and railways intersect. Every year, HRGCs in the US experience a high number of crashes leading to injuries and fatalities. Estimations of crash and severity models for HRGCs provide insights into safety and mitigation of the risk posed by such incidents. The accuracy of these models plays a vital role in predicting future crashes at these crossings, enabling necessary safety measures to be taken proactively. In the United States, most of these models rely on the Federal Railroad Administration's (FRA) HRGCs inventory database, which serves as the primary source of information for these models. However, errors or incomplete information in this database can significantly impact the accuracy of the estimated crash model parameters and subsequent crash predictions. This study examined the potential differences in expected number of crashes and severity obtained from the Federal Railroad Administration's (FRA) 2020 Accident Prediction and Severity (APS) model when using two different input datasets for 560 HRGCs in Nebraska. The first dataset was the unaltered, original FRA HRGCs inventory dataset, while the second was a field-validated inventory dataset, specifically for those 560 HRGCs. The results showed statistically significant differences in the expected number of crashes and severity predictions using the two different input datasets. Furthermore, to understand how data inaccuracy impacts model estimation for crash frequency and severity prediction, two new zero-inflated negative binomial models for crash prediction and two ordered probit models for crash severity, were estimated based on the two datasets. The analysis revealed significant differences in estimated parameters’ coefficients values of the base and comparison models, and different crash-risk rankings were obtained based on the two datasets. The results emphasize the importance of obtaining accurate and complete inventory data when developing HRGCs crash and severity models to improve their precision and enhance their ability to predict and prevent crashes.
Thesis
Full-text available
The control of data is a key element in supplying high data quality. A loss of data control can have extensive consequences for organizational efficiency. Various concepts and initiatives have been developed aiming at ensuring data of high quality by detecting anomalies. Although there are various concepts and initiatives regarding to ensure high data quality by detecting anomalies, there is still a lack of practical usage of unsupervised anomaly detection methods in automated data quality control. With respect to this lack, the present work constructs an automated preprocessing pipeline that interacts with unsupervised anomaly detection methods in order to identify errors in data. Unsupervised methods represent a promising tool due to their specialziation in finding deviations in data that do not conform to expected behavior. While the mentioned anomaly detection methods are applied in a variety of different domains there are only few approaches in the field of data quality control. This work addresses this aspect through the use and evaluation of unsupervised anomaly detection methods in the domain of data quality control. By conducting an experiment using real-life data this work presents an approach to assess unsupervised anomaly detection methods in real-world scenarios. The implementation and evaluation are carried out using two different datasets by injecting normal and abnormal entries into the test data. Regarding the evaluation, the findings present promising results in deep learning and subspace-based methods. Further, the subspace method Principal Component analysis has provided the best results due to the method being able to detect a wide range of possible anomalies within the analyzed problem types of data.
Article
Full-text available
Рассмотрена проблема наличия пропусков и неточностей в данных реестра операторов, осуществляющих обработку персональных данных, опубликованном на сайте Федеральной службы по надзору в сфере связи, информационных технологий и массовых коммуникаций. Сформирован комплекс инструментов и разработана методика оценки качества данных реестра. Предложенная методика реализована в информационно-аналитической системе, включающей модуль получения данных, репозиторий метаданных и модуль выполнения проверок. Проведены экспериментальные исследования разработанной системы на выборке, включающей данные по 1671 действующему на момент исследования оператору. Проведенная оценка качества данных показала наличие неточностей в данных 12% рассмотренных операторов, включая неполноту данных, наличие признаков несогласованности в данных, присутствие дублей и выход отдельных значений за пределы допустимого диапазона. Информация по обнаруженным ошибкам может быть использована для улучшения процесса взаимодействия сообщества операторов персональных данных и Федеральной службы по надзору в сфере связи, информационных технологий и массовых коммуникаций на основе предложенных в статье рекомендаций, в том числе в формате доработки электронной формы и самопроверки операторов во время ее заполнения.
Article
Full-text available
Resumo. No contexto da actual necessidade de explorar bases de dados, para delas extrair informação/conhecimento para apoio à gestão, é fundamental a correcção/validade dos dados para a qualidade dos resultados extraídos. Sendo certo que são várias as soluções parciais para a resolução dos problemas nos dados, tornou-se necessário fazer uma sistematização de todos os erros que po-dem ocorrer no sentido de identificar aqueles ainda não resolvidos e preconizar uma abordagem global. Este artigo descreve precisamente o referido estudo, e as ilações que se extraem quando se comparam os erros com as abordagens de limpeza de dados actualmente existentes, perspectivando-se a concepção de uma nova aproximação global à limpeza de dados como trabalho futuro, em consequência das conclusões obtidas.
Article
Full-text available
In today's society the exploration of one or more databases to extract information or knowledge to support management is a critical success factor for an organization. However, it is well known that several problems can affect data quality. These problems have a negative effect in the results extracted from data, influencing their correction and validity. In this context, it is quite important to understand theoretically and in practice these data problems. This paper presents a taxonomy of data quality problems, derived from real-world databases. The taxonomy organizes the problems at different levels of abstraction. Methods to detect data quality problems represented as binary trees are also proposed for each abstraction level. The paper also compares this taxonomy with others already proposed in the literature.
Article
Full-text available
Cleansing data from impurities is an integral part of data processing and mainte-nance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This paper pre-sents a survey of data cleansing problems, approaches, and methods. We classify the various types of anomalies occurring in data that have to be eliminated, and we define a set of quality criteria that comprehensively cleansed data has to ac-complish. Based on this classification we evaluate and compare existing ap-proaches for data cleansing with respect to the types of anomalies handled and eliminated by them. We also describe in general the different steps in data clean-sing and specify the methods used within the cleansing process and give an out-look to research directions that complement the existing systems.
Article
Full-text available
In a multidatabase system, schematic conflicts between two objects are usually of interest only when the objects have some semantic similarity. We use the concept of semantic proximity, which is essentially an abstraction/mapping between the domains of the two objects associated with the context of comparison. An explicit though partial context representation is proposed and the specificity relationship between contexts is defined. The contexts are organized as a meet semi-lattice and associated operations like the greatest lower bound are defined. The context of comparison and the type of abstractions used to relate the two objects form the basis of a semantic taxonomy. At the semantic level, the intensional description of database objects provided by the context is expressed using description logics. The terms used to construct the contexts are obtained from {\em domain-specific ontologies}. Schema correspondences are used to store mappings from the semantic level to the data level and are associated with the respective contexts. Inferences about database content at the federation level are modeled as changes in the context and the associated schema correspondences. We try to reconcile the dual (schematic and semantic) perspectives by enumerating possible semantic similarities between objects having schema and data conflicts, and modeling schema correspondences as the projection of semantic proximity with respect to (wrt) context.
Article
Full-text available
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often "dirty". Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.
Conference Paper
Traditionally, data quality programs have acted as a preprocessing stage to make data suitable for a data mining or analysis operation. Recently, data quality concepts have been applied to databases that support business operations such as provisioning and billing. Incorporating business rules that drive operations and their associated data processes is critically important to the success of such projects. However, there are many practical complications. For example, documentation on business rules is often meager. Rules change frequently. Domain knowledge is often fragmented across experts, and those experts do not always agree. Typically, rules have to be gathered from subject matter experts iteratively, and are discovered out of logical or procedural sequence, like a jigsaw puzzle. Our approach is to impement business rules as constraints on data in a classical expert system formalism sometimes called production rules. Our system works by allowing good data to pass through a system of constraints unchecked. Bad data violate constraints and are flagged, and then fed back after correction. Constraints are added incrementally as better understanding of the business rules is gained. We include a real-life case study.
Conference Paper
Multi-source information systems, such as data warehouses, are composed of a set of heterogeneous and distributed data sources. The relevant information is extracted from these sources, cleaned, transformed and then integrated. The confrontation of two different data sources may reveal different kinds of heterogeneities: at the intensional level, the conflicts are related to the structure of the data. At the extensional level, the conflicts are related to the instances of the data. The process of detecting and solving the conflicts at the extensional level is known as data cleaning. In this paper, we will focus on the problem of differences in terminologies and we propose a solution based on linguistic knowledge provided by a domain ontology. This approach is well suited for application domains with intensive classification of data such as medicine or pharmacology. The main idea is to automatically generate some correspondence assertions between instances of objects. The user can parametrize this generation process by defining a level of accuracy expressed using the domain ontology.