Conference PaperPDF Available

A Formal Definition of Data Quality Problems.

January 2005

January 2005

Source
DBLP

Conference: Proceedings of the 2005 International Conference on Information Quality (MIT IQ Conference), Sponsored by Lockheed Martin, MIT, Cambridge, MA, USA, November 10-12, 2006

Authors:

Paulo Oliveira

Instituto Superior de Engenharia do Porto

Fátima Rodrigues

Institute of Engineering of Porto – Polytechnic of Porto (ISEP/IPP)

Pedro Rangel Henriques

University of Minho

The exploration of data to extract information or knowledge to support decision making is a critical success factor for an organization in today's society. However, several problems can affect data quality. These problems have a negative effect in the results extracted from data, affecting their usefulness and correctness. In this context, it is quite important to know and understand the data problems. This paper presents a taxonomy of data quality problems, organizing them by granularity levels of occurrence. A formal definition is presented for each problem included. The taxonomy provides rigorous definitions, which are information-richer than the textual definitions used in previous works. These definitions are useful to the development of a data quality tool that automatically detects the identified problems.

Content uploaded by Fátima Rodrigues

Content may be subject to copyright.

A FORMAL DEFINITION OF DATA QUALITY PROBLEMS

(Completed paper)

Paulo Oliveira

DI/gEPL – Languages Specification and Processing Group

University of Minho (Portugal), and

GECAD/ISEP-IPP – Knowledge Engineering and Decision Support Group

Institute of Engineering – Polytechnic of Porto (Portugal)

pjo@isep.ipp.pt

Fátima Rodrigues

GECAD/ISEP-IPP - Knowledge Engineering and Decision Support Group

Institute of Engineering – Polytechnic of Porto (Portugal)

mfc@isep.ipp.pt

Pedro Henriques

DI/gEPL – Languages Specification and Processing Group

University of Minho (Portugal)

prh@di.uminho.pt

Abstract: The exploration of data to extract information or knowledge to support decision making

is a critical success factor for an organization in today’s society. However, several problems can

affect data quality. These problems have a negative effect in the results extracted from data,

affecting their usefulness and correctness. In this context, it is quite important to know and

understand the data problems. This paper presents a taxonomy of data quality problems, organizing

them by granularity levels of occurrence. A formal definition is presented for each problem

included. The taxonomy provides rigorous definitions, which are information-richer than the textual

definitions used in previous works. These definitions are useful to the development of a data quality

tool that automatically detects the identified problems.

Key Words: Data Quality Problems, Formal Definition, Taxonomy

1. INTRODUCTION

Nowadays, public and private organizations understand the value of data. Data is a key asset to improve

efficiency in today’s dynamic and competitive business environment. However, as organizations begin to

create integrated data warehouses for decision support, the resulting Data Quality (DQ) problems become

painfully clear [12]. A study by the Meta Group revealed that 41% of the data warehouse projects fail,

mainly due to insufficient DQ, leading to wrong decisions [8]. The quality of the input data strongly

influences the quality of the results [15] (“garbage in, garbage out” principle).

The concept of DQ is vast, comprising different definitions and interpretations. DQ is essentially studied

in two research communities: databases and management. The first one studies DQ from a technical point

of view (e.g., [4]), while the second one is also concerned with other aspects or dimensions (e.g.,

accessibility, believability, relevancy, interpretability, objectivity) involved in DQ (e.g., [13, 17]). In the

context of this paper we follow the databases perspective, i.e., DQ means just the quality of the data

values or instances.

DQ problems are also labeled of errors, anomalies or even dirtiness and enclose, among others, missing

attribute values, incorrect attribute values, or different representations of the same data. It is not

uncommon for operational databases to have 60% to 90% of bad data [3]. These problems are an obstacle

to effective data usage and, as already said, negatively affect the results and conclusions obtained.

Therefore, before using an analysis-oriented tool, data must be examined to check whether the required

quality is assured. If not, DQ must be improved by removing or repairing any problems that may exist

[10].

DQ problems concerns arise in three different contexts [4]: (i) when one wants to correct anomalies in a

single data source, as files and databases (e.g., duplicate elimination in a file); (ii) when poorly structured

or unstructured data is migrated into structured data; or (iii) when one wants to integrate data coming

from multiple sources into a single new data source (e.g., data warehouse construction). In the last

situation, the problems are even more critical because distinct data sources frequently contain redundant

data under different representations. The representations must be consolidated and the duplicates removed

to provide an accurate and consistent access to data.

The paper reports the new developments of our work, initially presented in [11]. The main contributions

provided here are: (i) a formal definition for each DQ problem and (ii) a taxonomy of DQ problems that

organizes them by granularity levels of occurrence.

In [7, 9, 14] a comprehensive list of problems that affect DQ is presented. However, the problems are

only described through a textual definition. It is commonly accepted that natural language tends to be

ambiguous by nature. Therefore, doubts about the true meaning of some DQ problems arise, which means

that they need to be further clarified. Using a formal definition is a suitable approach to specify each DQ

problem in a rigorous way.

Besides rigorous, this kind of definition is also useful because has more extra information than a textual

definition. The definition makes explicit: (i) that it concerns just with a given data type (e.g. string data

type); (ii) the metadata knowledge needed to detect a problem (e.g. the attribute domain); (iii) the

mathematical expression that specifies the DQ problem, which can be computationally translated to

automate its detection (just for illustration purposes we show the definition of domain violation later

presented: ∃ t ∈ r : v(t,a) ∉ Dom(a)); and (iv) eventually a required function that allows to detect the DQ

problem (e.g. to detect a misspelling error, a spell checker function must be available). A framework for

DQ problems is our first step towards the development of an automated tool for detecting the problems.

With this tool we intend to complement the capabilities of today’s data profiling tools.

We argue that a taxonomy of DQ problems is important because: (i) it is useful to understand how far a

given DQ tool is able to go in detecting and correcting DQ problems, i.e., it allows to measure the

coverage of a DQ tool; (ii) guides the research efforts, emphasizing the DQ problems that deserve further

attention, i.e., if a DQ problem has no detection or correction support, this means that research attention

should be given to it.

The paper is organized as follows. Section 2 presents in detail our taxonomy, organized by granularity

levels of DQ problems. Section 3 compares our taxonomy to related work. Finally, in Section 4,

conclusions and some future work directions are described.

2. DATA QUALITY PROBLEMS

Figure 1 presents the well known typical model of data organization: (i) data is stored in multiple data

sources; (ii) a data source is composed of several relations and relationships are established among them;

(iii) a single relation is made up of several tuples; and (iv) a tuple is composed by a predefined number of

attributes. This model results in a hierarchy of four levels of data granularity: multiple data sources;

multiple relations; single relation; and attribute/tuple.

We have identified the DQ problems and created the taxonomy based on this model. Using real-world

data from the retail sector, we thoroughly analyzed each granularity level, from the lowest (attribute/

tuple) to the highest (multiple data sources), to detect specific DQ problems. The purpose was to tackle

first the most specific and easier to detect DQ problems and leave to the end the most generic and difficult

ones. The analysis was based on the fundamental elements of this model of data organization (e.g., data

types, relationships, data representation structure). The systematic approach used supports our conviction

that the taxonomy is complete.

The DQ problems identified are presented through a rigorous definition and properly illustrated with an

example. Our taxonomy covers the problems that affect data represented in a tabular format, i.e., at least

in the first normal form. The types of data considered are: numeric, date/time, enumerated, and string.

Multimedia data was excluded, since it requires a special kind of manipulation.

2.1 Preliminaries

We start by introducing the notation used throughout the paper, following [1]. A relation schema consists

of a name R (the relation name), along with a list A = a1,a2,...,an of distinct attribute names, represented by

R(a1,a2,...,an) or simply by R(A). The integer n represents the relation schema degree. A data source DS is

composed by a set of m relation schemas R1(A1), R2(A2),...,Rm(Am). A domain d is a set of atomic values.

Given a set D of domains, a set A of attribute names, we assume there is a function Dom: A → D that

associates attributes with their domains. The function Dom is applied to the set of attribute names, which

is represented by Dom(A) = Dom(a1,a2,...,an) = Dom(a1) × Dom(a2) × ...×Dom(an). A relation instance (or

...

Tuple

Relation

2... Relation

at. 1

Tuple

2 ...

at. 2 at. n ...

at. 1 at. 2 at. n at. 1 at. 2 at. n

... ...

...

... ...

Data

Source

Data

Source

Data

Source

Tuple

... Attribute/tuple

Single relation

Multiple data sources

Multiple relations

ure 1: T

ical model of data or

anization

relation, for short) is a finite set r ⊆ Dom(a1)×Dom(a2)×...×Dom(an) and is represented by r(a1,a2,...,an) or

simply by r(A). Each element of r is called a tuple, and is represented by t. A tuple t can be viewed as a

function that associates a value of Dom(a1) with a1, a value of Dom(a2) with a2,... and a value of Dom(an)

with an. The value of attribute a in tuple t is represented by v(t,a). The values in tuple t of attributes

a1,a2,...,an i.e., v(t,a1),v(t,a2),...,v(t,an) are denoted by v(t,A). By analogy, the values attribute a take for all

tuples are represented by v(T,a). The application of a function f over a value v is represented by f(v) while

over a set of values V is represented by f(V). If, for some reason, a value transformation cannot be done,

the function f acts as the identity function, i.e., f(v) = v or f(V) = V. The data type of attribute a is denoted

by type(a).

2.2 DQ Problems at the Level of Attribute/Tuple

This level is divided into three groups of DQ problems that were encountered by analyzing the value(s)

of: (i) a single attribute of a single tuple; (ii) a single attribute in multiple tuples (a column); and (iii)

multiple attributes of a single tuple (a row).

2.2.1 Single Attribute of a Single Tuple

The following DQ problems were detected by analyzing the values of single attributes (with different data

types) in single tuples (as presented in Figure 2).

• Missing value

Definition: Let S be a set of attribute names, defined as: S = {a | a ∈ R(A) ∧ a is a mandatory

attribute}, i.e., S ⊆ R(A). There is a missing value in attribute a ∈ S if and only if (iff): ∃ t ∈ r :

v(t,a) = null.

Example: Absence of value in the mandatory attribute Name of a customer.

Note: The absence of value in an optional attribute is not considered by us as a DQ problem.

• Syntax violation

Definition: Let G(a) be the syntax of attribute a, given by a grammar or a regular expression. Let

L(G(a)) be the language generated by the grammar or regular expression. There is a syntax violation

in attribute a ∈ R(A) iff: ∃ t ∈ r: v(t,a) ∉ L(G(a)).

Example: The attribute Order_Date contains the value 13/12/2004, instead of 2004/12/13.

• Incorrect value

Definition: Let u(t,a) be the correct and updated value that the attribute a of tuple t was supposed

to have. There is an incorrect value in attribute a ∈ R(A) iff: ∃ t ∈ r : v(t,a) ∈ Dom(a) ∧ v(t,a) ≠

u(t,a).

...

Tuple

at. 1

Tuple

2 ...

at. 2 at. n ...

at. 1 at. 2 at. n at. 1 at. 2 at. n

Tuple

...

Figure 2: A single attribute of a single tuple

Example: The attribute Creation_Date contains the value 23/09/2003, instead of 23/09/2004.

• Domain violation

Definition: There is a domain violation in attribute a ∈ R(A) iff: ∃ t ∈ r : v(t,a) ∉ Dom(a).

Example: In a given order, the attribute Ordered_Quantity contains a negative value.

• Violation of business domain constraint

Definition: Let check be a function that receives an attribute value, checks whether it respects a

given constraint, and returns a boolean value. There is a violation of business domain constraint in

attribute a ∈ R(A) iff: ∃ t ∈ r : check(v(t,a)) = false.

Example: The attribute Name of a customer must have, at least, two words; however, in a certain

tuple this constraint is not respected.

A domain violation in an attribute whose data type is string, may be further detailed as presented next.

Let S be a set of attribute names, defined as: S = {a | a ∈ R(A) ∧ type(a) = string}, i.e., S ⊆ R(A). This set

is used in the following definitions.

• Invalid substring

Definition: Let v’(t,a) be a substring of v(t,a). There is an invalid substring in attribute a ∈ S iff: ∃ t

∈ r : v(t,a) ∉ Dom(a) ∧ v’(t,a) ∈ Dom(a).

Example: The attribute Customer_Name also stores the academic degree (e.g., Dr. John Taylor).

• Misspelling error

Definition: Let spell be a spelling checker function that receives a misspelled word, looks-up for

the correct word based on a language dictionary, and returns it. There is a misspelled error in

attribute a ∈ S iff: ∃ t ∈ r : v(t,a) ∉ Dom(a) ∧ spell(v(t,a)) ∈ Dom(a).

Example: The attribute Address_Place contains the value Sant Louis, instead of Saint Louis.

• Imprecise value

Definition: Let translate be a function that receives an abbreviation or an acronym, looks-up for its

meaning (in full words) in a dictionary (lookup table), and returns it. There is an imprecise value in

attribute a ∈ S iff: ∃ t ∈ r : v(t,a) ∉ Dom(a) ∧ translate(v(t,a)) ∈ Dom(a).

Example: The value Ant. in attribute Customer_Contact may represent Anthony, Antonia, etc.

2.2.2 Single Attribute in Multiple Tuples

The following DQ problems were identified by analyzing the values of a single attribute in multiple

tuples, as illustrated in Figure 3.

• Unique value violation

Definition: Let S be a set of attribute names, defined as: S = {a | a ∈ R(A) ∧ a is a unique value

attribute}, i.e., S ⊆ R(A). There is a unique value violation in attribute a ∈ S iff: ∃ t1, t2 ∈ r : v(t1,a)

= v(t2,a) ∧ t1 ≠ t2 .

Example: Two different customers have the same taxpayer identification number.

• Existence of synonyms

Definition: Let S be a set of attribute names, defined as: S = {a | a ∈ R(A) ∧ type(a) = string}, i.e.,

S ⊆ R(A). Let meaning be a function that receives a word, looks-up for its meaning in a dictionary,

and returns it. There are synonyms in attribute a ∈ S iff: ∃ t1, t2 ∈ r : v(t1,a) ≠ v(t2,a) ∧

meaning(v(t1,a)) = meaning(v(t2,a)).

Example: The attribute Occupation contains the values Professor and Teacher in different tuples,

which in fact represent the same occupation.

• Violation of business domain constraint

Definition: Let check be a function that receives the set of all values of an attribute, checks whether

a given constraint is respected, and returns a boolean value. There is a violation of business domain

constraint in attribute a ∈ R(A) : check(v(T,a)) = false.

Note: As defined in section 2.1, v(T,a) represents the values that attribute a take for all tuples.

Example: The values of attribute Invoice_Date must appear in the relation by ascending order, but

this does not happens.

2.2.3 Multiple Attributes of a Single Tuple

The following DQ problems were identified by analyzing the values of multiple attributes of a single

tuple, as illustrated in Figure 4.

• Semi-empty tuple

Definition: Let θ be a user-defined threshold (a real number between 0 and 1), and S the set of

attribute names that are empty in tuple t, defined as: S = {a | a ∈ R(A) ∧ v(t,a) = null}, i.e., S ⊆

R(A). Let m be the cardinality of set S, defined as: m = |S|, and n be the relation schema degree. The

tuple t is a semi-empty tuple iff: m/n ≥ θ.

...

Tuple

at. 1

Tuple

2 ...

at. 2 at. n ...

at. 1 at. 2 at. n at. 1 at. 2 at. n

Tuple

...

Figure 3: A single attribute in multiple tuples

Example: If 60% or more of the tuple attributes are empty, then the tuple is classified as semi-

empty

• Violation of functional dependency

Definition: Let a2 be an attribute whose value functionally depends on the values of other

attributes. The set of these attribute names is defined as: S = {a1 | a1, a2 ∈ R(A) : the value of a2

functionally depends on the value of a1}, i.e., S ⊆ R(A). Let value be a function that receives a set

of values of a tuple, computes the value of the functional dependent attribute, and returns it. There

is a violation of functional dependency in tuple t iff: ∃ t ∈ r : value(v(t,S)) ≠ v(t,a2).

Example: There is a functional dependency among Zip_Code and City. Each value of the first

attribute must be associated with exactly one value of the second. Therefore, the following values

of two customer tuples violate the functional dependency: (Zip_Code = 4000; City = “Porto”) and

(Zip_Code = 4000; City = “Lisboa”).

• Violation of business domain constraint

Definition: Let check be a function that receives the set of values of a tuple, checks whether a

given constraint x is respected, and returns a boolean value. Let S be a set of attribute names,

defined as: S = {a | a ∈ R(A) ∧ a is used in the formulation of x}, i.e., S ⊆ R(A). There is a violation

of business domain constraint in tuple t ∈ r iff: check(v(t,S)) = false.

Note: As defined in section 2.1, v(t,S) represents the values in tuple t of the attributes that belong to

Example: The business domain constraint among attribute values: Total_Product = Quantity *

Sell_Price, does not hold for a given tuple of the Sales_Details relation.

2.3 DQ Problems at the Level of a Single Relation

The DQ problems described in this section were identified by analyzing the values of multiple attributes

in multiple tuples of a relation, as illustrated in Figure 1.

• Approximate duplicate tuples

Definition: Let S be a set of attribute names, defined as: S = {a | a ∈ R(A) ∧ a does not belong to

the primary key}, i.e., S ⊆ R(A). Let θ be a real number between 0 and 1. Let similarity be a

function that receives two values of an attribute, computes the similarity among them, and returns it

(also as a real number between 0 and 1). There are approximate duplicate tuples in relation r iff: ∃

...

Tuple

at. 1

Tuple

2 ...

at. 2 at. n ...

at. 1 at. 2 at. n at. 1 at. 2 at. n

Tuple

...

Figure 4: Multiple attributes of a single tuple

t1, t2 ∈ r ∀a ∈ S : similarity(v(t1,a),v(t2,a)) ≥ θ ∧ t1 ≠ t2.

Example: The tuple Customer(10, ‘Smith Barney’, ‘Flowers Street, 123’, 502899106) is an

approximate duplicate of the tuple Customer(72, ‘S. Barney’, ‘Flowers St., 123’, 502899106).

• Inconsistent duplicate tuples

Definition: Let S be a set of attribute names, defined as: S = {a | a ∈ R(A) ∧ a does not belong to

the primary key}, i.e., S ⊆ R(A). Let θ be a real number between 0 and 1. Let similarity be a

function that receives two values of an attribute, computes the similarity between them, and returns

it (also as a real number between 0 and 1). There are inconsistent duplicate tuples in relation r iff: ∃

a2 ∈ S, t1, t2 ∈ r ∀a1 ∈ S\{a2}: similarity(v(t1,a1),v(t2,a1)) ≥ θ ∧ similarity(v(t1,a2), v(t2,a2)) < θ.

Example: The tuple Customer(10, ‘Smith Barney’, ‘Flowers Street, 123’, 502899106) is an

inconsistent duplicate of the tuple Customer(72, ‘Smith Barney’, ‘Sun Street, 321’, 502899106).

• Violation of business domain constraint

Definition: Let check be a function that receives the attribute values of all tuples, checks whether a

given constraint x is respected, and returns a boolean value. Let S be a set of attribute names,

defined as: S = {a | a ∈ R(A) ∧ a is used in the formulation of x}, i.e., S ⊆ R(A). There is a violation

of business domain constraint in tuple t iff: check(v(T,S)) = false.

Note: v(T,S) represents the values that the attributes belonging to S take for all tuples.

Example: The maximum number of products families allowed in relation Products_Families is 10,

but the existent number of families is 12.

2.4 DQ Problems at the Level of Multiple Relations

In this section, we present the DQ problems detected when analyzing the values from multiple relations,

as presented in figure 1.

We assume there is a relationship among the relation schemas R1(A1) and R2(A2) of a data source DS. Let

S and T be sets of attribute names, defined as: S = {a | a ∈ R1(A1) ∧ a belongs to the foreign key that

establishes the relationship with R2(A2)}, i.e., S ⊆ R1(A1), and T = {a | a ∈ R

2(A2) ∧ a belongs to the

primary key}, i.e., T ⊆ R2(A2). These two sets are used in the following definitions.

• Referential integrity violation

Definition: Let V be the set of values of the primary key attributes, defined as: V = {v(t,T) | t ∈ r2}.

There is a referential integrity violation among relations r1 and r2 iff: ∃ t ∈ r1 : v(t,S) ∉ V.

Example: The attribute Customer_Zip_Code of the Customer relation contains the value 5100,

which does not exists in the Zip_Code relation.

• Incorrect reference

Definition: Let V be the set of values of the primary key attributes, defined as: V = {v(t,T) | t ∈ r2}.

Let u(t,S) be the correct and updated value that was supposed to be in the foreign key S of tuple t of

relation r1. There is an incorrect reference among relations r1 and r2 iff: ∃ t ∈ r1 : v(t,S) ∈ V ∧ v(t,S)

≠ u(t,S).

Example: The attribute Customer_Zip_Code of the Customer relation contains the value 4415,

instead of 4445; both zip codes exist in the Zip_Code relation.

• Heterogeneity of syntaxes

Definition: Let G(a) be the syntax of attribute a, given by a grammar or a regular expression. There

is a heterogeneity of syntaxes among relations r1 and r2 iff:

∀

a1 ∈ R1(A1), a2 ∈ R2(A2) : type(a1) =

type(a2) ∧ G(a1) ≠ G(a2).

Example: The attribute Order_Date of relation Orders has the syntax dd/mm/yyyy, while the

attribute Invoice_Date of relation Invoices has the syntax yyyy/mm/dd.

• Circularity among tuples in a self-relationship

Definition: Let U be a set of attribute names, defined as: U = {a | a ∈ R1(A1) ∧ a belongs to the

primary key}, i.e., U ⊆ R1(A1). Let V be the set that contains the primary key values of all existing

tuples in r1, defined as: V = {v(t,U) | t ∈ r1}. Let v be the value of a primary key: v ∈ V. Let W be

the set that, starting from the tuple identified by the primary key v, contains the foreign key values

of all other tuples related with it, defined as: W = {v(t1,S) | v(t1,S) = v(t2,U) ∧ t1, t2 ∈ r1}. There is a

circularity among tuples in a self-relationship in relation r1 iff: v ∈ W.

Example: A product may be a sub-product in another product and this information is stored in

attribute Sub-product_Cod of the product; In relation Products there exists the information that

product X (Product_Cod = ‘X’) is sub-product of Y (Sub-product_Cod = ‘Y’) and simultaneously

that product Y (Product_Cod = ‘Y’) is sub-product of X (Sub-product_Cod = ‘X’); this is an

impossible situation.

• Violation of business domain constraint

Definition: Let check be a function that receives the attribute values of the tuples from relations r1

and r2, checks whether a given constraint x is respected, and returns a boolean value. Let U and V

be sets of attribute names, defined as: U = {a | a ∈ R1(A1) ∧ a is used in the formulation of x}, i.e.,

U ⊆ R1(A1), and V = {a | a ∈ R2(A2) ∧ a is used in the formulation of x}, i.e., V ⊆ R2(A2). Let W and

Z be sets of attribute values of related tuples from each relation, defined as: W = {v(t1,U) | v(t1,S) =

v(t2,T) ∧ t1 ∈ r1 ∧ t2 ∈ r2} and Z = {v(t2,V) | v(t2,T) = v(t1,S) ∧ t1 ∈ r1 ∧ t2 ∈ r2}. There is a violation

of business domain constraint among relations r1 and r2 iff: check(W, Z) = false.

Example: The attribute Invoice_Total of a tuple of the relation Invoices contains the value 100,

while the sum of the values of attribute Product_Value (for each product of the invoice) of the

relation Invoices_Details is only equal to 90 (instead of 100).

2.5 DQ Problems at the Level of Multiple Data Sources

The DQ problems presented below were identified by analyzing the values of multiple data sources, as

illustrated in Figure 1. As referred in section 1, this paper only addresses the DQ problems related with

the instances (values) of data, i.e., the extensional level [6]. There are other kinds of DQ problems that

occur at the intensional level, i.e., problems related with the structure of data [6], also known as problems

among data schemas. For the reader interested in these problems, we suggest the work of Kashyap and

Sheth [5].

In this section, we assume that the relation schemas R1(A1) and R2(A2) belong to two different data

sources, respectively, DS1 and DS2. Both schemas concern the same real-world entity (e.g., customers).

We also assume that relation schema heterogeneities among DS1 and DS2 are solved, i.e., two attributes

referring to the same real-world property (e.g., unitary price) have the same name. However, the number

of attributes used in each data schema may be different.

• Heterogeneity of syntaxes

Definition: Let G(a) be the syntax of attribute a, given by a grammar or a regular expression. There

is heterogeneity of syntaxes among relations r1 and r2 iff :

∀

a1 ∈ R1(A1), a2 ∈ R2(A2) : a1 = a2 ∧

type(a1) = type(a2) ∧ G(a1) ≠ G(a2).

Example: The attribute Insertion_Date of relation Customers from DS1 has the syntax dd/mm/yyyy,

while the attribute Insertion_Date of relation Customers from DS2, has the syntax yyyy/mm/dd.

• Heterogeneity of measure units

Definition: Let S be the set of attribute names common to both relations, defined as: S = {a | a ∈

R1(A1) ∧ a ∈ R

2(A2) ∧ type(a) = numeric}. Let k be a numeric constant value. There is

heterogeneity of measure units in attribute a of relations r1 and r2 iff: ∃ a ∈ S ∀t1 ∈ r1 ∃ t2 ∈ r2:

v(t1,a) = k * v(t2,a) ∧ k > 0 ∧ k ≠ 1.

Example: The attribute Product_Sell_Price is represented in euros in DS1, while in DS2 is

represented in dollars.

• Heterogeneity of representation

Definition: Let S be the set of attribute names common to both relations, defined as: S = R1(A1) ∩

R2(A2). Let translate be a function that receives an attribute value from a relation, looks-up in a

dictionary (lookup table) for the corresponding value in the other relation, and returns it. There is

heterogeneity of representation in attribute a among relations r1 and r2 iff: ∃ a ∈ S, t1 ∈ r1, t2 ∈ r2 :

v(t1,a) ≠ v(t2,a) ∧ translate(v(t1,a)) = v(t2,a).

Example: To represent the attribute Gender the values F and M are used in DS1, while in DS2 are

used the values 0 and 1.

• Existence of synonyms

Definition: Let S be the set of attribute names common to both relations, defined as: S = {a | a ∈

R1(A1) ∧ a ∈ R2(A2) ∧ type(a) = string}. Let meaning be a function that receives a word, looks-up

for its meaning in a dictionary, and returns it. There are synonyms in attribute a among relations r1

and r2 iff: ∃ a ∈ S, t1 ∈ r1, t2 ∈ r2 : meaning(v(t1,a)) = meaning(v(t2,a)) ∧ v(t1,a) ≠ v(t2,a).

Example: The relation Occupations of DS1 contains a tuple with Professor, while the equivalent

relation in DS2 contains a tuple with Teacher; both represent the same occupation.

• Existence of homonyms

Definition: Let S be the set of attribute names common to both relations, defined as: S = {a | a ∈

R1(A1) ∧ a ∈ R

2(A2) ∧ type(a) = string}. Let meaning1 and meaning2 be functions that receive a

word, look-up for its meaning in a dictionary (in the context of DS1 or DS2), and return it. There are

homonyms in attribute a among relations r1 and r2 iff: ∃ a ∈ S, t1 ∈ r1, t2 ∈ r2 : meaning1(v(t1,a)) ≠

meaning2(v(t2,a)) ∧ v(t1,a) = v(t2,a).

Example: In relation Products of DS1, there exists a product named Mouse (a branch of a company

sells computer hardware), while in relation Products of DS2, there also exists a product named

Mouse (another branch of the company sells domestic animals, so the products here are the animals

themselves).

• Approximate duplicate tuples

Definition: Let S be a set of attribute names, defined as: S = {a | a ∈ R(A) ∧ a does not belong to

the primary key}, i.e., S ⊆ R(A). Let θ be a real number between 0 and 1. Let similarity be a

function that receives two values of an attribute, computes the similarity between them, and returns

it (also as a real number between 0 and 1). There are approximate duplicate tuples among relations

r1 and r2 iff: ∃ t1 ∈ r1, t2 ∈ r2 ∀a ∈ S : similarity(v(t1,a),v(t2,a)) ≥ θ.

Example: The tuple Customer(10, ‘Smith Barney’, ‘Flowers Street, 123’, 502899106) in DS1 is

an approximate duplicate of the tuple Customer(27, ‘Smith B.’, ‘Flowers St., 123’, 502899106) in

DS2.

• Inconsistent duplicate tuples

Definition: Let S be a set of attribute names, defined as: S = {a | a ∈ R(A) ∧ a does not belong to

the primary key}, i.e., S ⊆ R(A). Let θ be a real number between 0 and 1. Let similarity be a

function that receives two values of an attribute, computes the similarity between them, and returns

it (also as a real number between 0 and 1). There are inconsistent duplicate tuples among relations

r1 and r2 iff: ∃ t1 ∈ r1, t2 ∈ r2, a2 ∈ S

∀

a1 ∈ S\{a2}: similarity(v(t1,a1),v(t2,a1)) ≥ θ ∧

similarity(v(t1,a2),v(t2,a2)) < θ.

Example: The tuple Customer(10, ‘Smith Barney’, ‘Flowers Street, 123’, 502899106) in DS1 is an

inconsistent duplicate of the tuple Customer(27, ‘Smith Barney’, ‘Sun Street, 321’, 502899106) in

DS2.

• Violation of business domain constraint

Definition: Let check be a function that receives the attribute values of the tuples from relations r1

and r2 (of DS1 and DS2), checks whether a given constraint x is respected, and returns a boolean

value. Let S be the set of attribute names common to both relations, defined as: S = {a | a ∈ R1(A1)

∧ a ∈ R2(A2) ∧ a is used in the formulation of x}. Let T and U be sets that contain the attribute

values of all tuples from relations r1 and r2, defined as: T = {v(t1,S) | t1 ∈ r1} and U = {v(t2,S) | t2 ∈

r2}. There is a violation of business domain constraint among relations r1 and r2 iff: check(T, U) =

false.

Example: The maximum number of products families allowed is 10; the relation Product_Families

in DS1 contains 7 families, and the relation Product_Families in DS2 contains 8 families; the

number of distinct product families resulting from the integration (union) of both sources is 11; this

number violates the constraint.

2.6 Summary

Table 1 presents a summary of our taxonomy of DQ problems. The problems and the corresponding

granularity levels where they occur are shown in the table.

3. RELATED WORK

Kim et al. [7] present a quite complete taxonomy of DQ problems, describing the logic behind its

structure. They adopt a successive hierarchical refinement approach. The taxonomy is based on the

premise that DQ problems manifest in three different ways: missing data; not missing but wrong data; and

not missing and not wrong but unusable. Unusable data occurs when two or more databases are integrated

or representation standards are not consistently used when entering data. The taxonomy is a hierarchical

decomposition of these three basic manifestations of DQ problems. Considering the approach used and

the DQ problems identified, this taxonomy is the closest to ours.

Attribute/Tuple

Data Quality Problem Attrib. Column Row Single

Relation Multiple

Relations Mult. Data

Sources

Missing value x

Syntax violation x

Incorrect value x

Domain violation x

Invalid substring x

Misspelling error x

Imprecise value x

Violation of business domain

constraint x x x x x x

Unique value violation x

Existence of synonyms x x

Semi-empty tuple x

Violation of functional

dependency x

Approximate duplicate tuples x x

Inconsistent duplicate tuples x x

Referential integrity violation x

Incorrect reference x

Heterogeneity of syntaxes x x

Circularity among tuples in a

self-relationship x

Heterogeneity of measure units x

Heterogeneity of representation x

Existence of homonyms x

Müller and Freytag [9] roughly classify DQ problems into syntactical, semantic, and coverage anomalies.

Syntactical anomalies describe characteristics concerning the syntax and values used for representation of

the entities (e.g. lexical errors, domain syntax errors). Semantic anomalies hinder the data collection from

being a comprehensive and non-redundant representation of the real-world (e.g. duplicates,

contradictions). Coverage anomalies are related with the amount of entities and entities properties from

the real-world actually stored in the data collection (e.g. missing values). This work is limited to DQ

problems that occur in a single relation of a single source, so important DQ problems are not covered.

Rahm and Do [14] distinguish between single-source and multi-source problems as we do. However, at

single-source they do not divide the problems into those that occur in a single relation and those that

occur as a result of existing relationships among multiple relations. Single-source and multi-source

problems are divided into schema-related and instance-related problems. Schema-related problems are

those that can be addressed by improving the schema design, schema translation and schema integration.

Instance-related problems correspond to errors and inconsistencies in the actual data contents that can not

be prevented at the schema level. As referred in the introduction, we are only concerned with the DQ

problems related with the instances of the data, so we do not make this separation. In single source

problems, for both schema-related and instance-related, they distinguish between the following scopes:

attribute; record; record type and source. This is similar to the organization that we present in our

taxonomy.

Table 1: DQ problems organized by granularity level

Even tough the term used may be different (e.g. enterprise constraint violation; business rule violation),

the DQ problem violation of business domain constraint included in our taxonomy is mentioned in almost

every book about databases (e.g. [2, 16]). However, surprisingly it is not included in any of the

taxonomies analysed. The other new DQ problems also introduced by our taxonomy are: (i) semi-empty

tuple; (ii) heterogeneity of syntaxes (at the level of multiple relations and multiple data sources); and (iii)

circularity among tuples in a self-relationship. All the problems identified in the three taxonomies are also

covered by ours, although the names used to label the problems are sometimes different. Finally, the DQ

problems have been described only through a textual description, while we present a rigorous definition

for each problem.

4. CONCLUSION

This paper has presented our taxonomy of DQ problems. The taxonomy results from the research

conducted to identify DQ problems on each granularity level of the usual data organization model. The

study followed a bottom-up approach, from the lowest (attribute/tuple) to the highest granularity level

(multiple data sources) where DQ problems may appear. The taxonomy was also presented in the paper

following that approach. Six groups of related DQ problems were derived from the four granularity

levels. As the approach followed to identify the problems was exhaustive and systematic, it allows us to

be confident that no other problem is missing.

The DQ problems included in our taxonomy were specified through rigorous definitions. This feature

distinguishes our taxonomy from the related ones, since they only use text to describe the problems. We

believe that giving a formal framework to DQ problems is a valuable contribution, since: (a) it is the only

way to assure a clear and precise definition for each DQ problem; and (b) it is useful because it specifies

what is required to detect automatically the problem, i.e.: (i) the metadata knowledge needed; (ii) the

mathematical expression that defines the DQ problem, which can be seen as a logical procedure (rule) to

detect it; and (iii) eventually the function that is required to perform some transformation. These elements

are explicitly included in each definition of DQ problem.

This work is a first step towards the development of a tool to automatically detect DQ problems. The

entire set of DQ problems that may affect data is now known and understood by us. We also know what is

needed to detect each DQ problem and how that can be translated to a computational method. All these

items need to be organized to produce the DQ tool architecture. In fact, this is what we intend to do as our

next work. After, we intend to start the tool development. We believe that it will complement the limited

detection capabilities currently supported by commercial data profiling tools.

ACKNOWLEDGMENTS

We would like to thank Helena Galhardas for the fruitful discussions and useful comments that helped us

to improve the contents of the paper.

REFERENCES

[1] Atzeni, P. and Antonellis, V. – Relational Database Theory. The Benjamin/Cummings Publishing Company,

Inc., 1983.

[2] Connolly, T. and Begg, C. – Database Systems: A Practical Approach to Design, Implementation and

Management. Addison Wesley Longman Limited, 1999. ISBN 0-201-34287-1.

[3] Dasu, T.; Vesonder, G. T. and Wright, J. R. – “Data Quality through Knowledge Engineering”. In Proceedings

of the SIGKDD'03 Conference, Washington. August 2003. pp. 705-710.

[4] Galhardas, H.; Florescu, D.; Shasha, D.; Simon, E. and Saita, C.-A. – “Data Cleaning: Language, Model, and

Algorithms”. In Proceedings of the Very Large Databases Conference (VLDB). 2001.

[5] Kashyap, V. and Sheth, A. – “Schematic and Semantic Similarities Between Database Objects: a Context-

Based Approach”. Very Large Databases Journal, 5 (4). 1996. pp. 276–304.

[6] Kedad, Z. and Métais E. – “Ontology-Based Data Cleaning”. Lecture Notes in Computer Science, 2553, 2002.

pp. 137 – 149.

[7] Kim, W.; Choi, B.-J.; Hong, E.-K.; Kim, S.-K. and Lee, D. – “A Taxonomy of Dirty Data”. Data Mining and

Knowledge Discovery, 7. 2003. pp. 81-99.

[8] Meta Group – Data Warehouse Scorecard. Meta Group, 1999.

[9] Müller, H. and Freytag, J.-C. – “Problems, Methods, and Challenges in Comprehensive Data Cleansing”.

Technical Report HUB-IB-164, Humboldt University, Berlin, 2003.

[10] Oliveira, P.; Rodrigues, F. and Henriques, P. – “Limpeza de Dados: Uma Visão Geral”. In Belo, O.; Lourenço,

A. and Alves, R. (Eds.) – Proceedings of Data Gadgets 2004 Workshop – Bringing Up Emerging Solutions for

Data Warehousing Systems (in conjunction with JISBD’04), Málaga, Spain, November 2004. pp. 39-51 (in

Portuguese).

[11] Oliveira, P.; Rodrigues, F.; Henriques, P. and Galhardas, H. – “A Taxonomy of Data Quality Problems”. In

Proceedings of the 2nd International Workshop on Data and Information Quality (in conjunction with

CAiSE’05), Porto, Portugal, June 2005.

[12] Orr, K. – “Data Quality and Systems Theory”. Communications of the ACM, 41 (2). 1998. pp. 66-71.

[13] Pipino, L.; Lee, Y. and Wang, R. – “Data Quality Assessment”. Communications of the ACM, 45 (4). 2002. pp.

211-218.

[14] Rahm, E. and Do, H. H. – “Data Cleaning: Problems and Current Approaches”. IEEE Bulletin of the Technical

Committee on Data Engineering, 24 (4). 2000.

[15] Sattler, K. and Schallehn, E. – “A Data Preparation Framework based on a Multidatabase Language”. In

Proceedings of International Database Engineering and Applications Symposium (IDEAS 2001), Grenoble,

France. IEEE Computer Society. 2001. pp. 219-228.

[16] Ullman, J. and Widom, J. – A First Course in Database Systems. Prentice-Hall, Inc., 1997. ISBN 0-13-861337-

[17] Wand, Y. and Wang, R. – “Anchoring Data Quality Dimensions in Ontological Foundations”. Communications

of the ACM, 39 (11). 1996. pp. 86-95.

Unlocking the Potential of German Smart Cities: Strategy Analysis through Online Content Examination

Article

Full-text available

Jan 2024

M. Schmuck

This paper presents a comprehensive analysis of smart city strategies in Germany through the examination of online content. Smart city initiatives have gained significant attention worldwide as urban areas seek innovative solutions to address various challenges. In Germany, renowned for its technological achievement and commitment to sustainability, has witnessed the emergence of numerous Smart City projects. The aim of this study is to investigate the current status of Smart Cities strategy initiatives in Germany based on a sample of 82 cities. To achieve this, an online content analysis methodology is employed, using the official web presences of the selected cities and information published on them. The content analysis focuses on identifying empirical evidence related to digitalisation, data and governance, and obtaining their transparency and progress in a sample of 82 major German cities. The findings of this study reveal the diversity and richness of German smart city approaches, but also a varying degree of sophistication and transparency of strategic initiatives. As an expression of Germany's federal structure, the cities focus on different aspects, such as mobility, energy efficiency or data-driven administration. The insights gained from this online content analysis provide valuable guidance for policymakers, urban planners and stakeholders involved in the design of smart city strategies. By understanding the strengths and weaknesses of existing initiatives, policymakers can refine their approaches, address challenges and promote sustainable and inclusive urban development. Furthermore, this study contributes to the academic discourse on smart cities by providing empirical evidence on the implementation of strategies in a German context. Overall, this study contributes to a deeper understanding of the German smart city landscape and serves as a basis for future research, policy formulation and the pursuit of smarter and more liveable cities.

Modeling and Simulation of Reliability of Networked and Distributed Systems: A Case Data Reliability Model

Article

Aug 2023

Strategic organizational changes: Adopting data‐driven decisions

Article

Full-text available

Nov 2023

The paper examines the adoption of a data‐driven decision‐making (DDDM) process in organizations from purposefully selected European Union (EU) countries. It determines what organizational changes are required to adopt this process in organizations. This study uses a mixed‐method approach to identify organizational changes required for DDDM adoption. The responses from quantitative research in 10 EU countries (1091 respondents) and qualitative research with 20 C‐level managers are analyzed. The study offers the following organizational changes needed to implement DDDM in organizations: culture and mindset changes, digitalization, process improvements, new competencies, re‐organization, and legal requirements. This research contributes to a better understanding of the usage and adoption of DDDM globally and suggests specific organizational changes required to adopt this process.

A Review: Data Quality Problem in Predictive Analytics

Article

Full-text available

Aug 2023

Heru Nugroho

As data size continues to grow, there has been a revolution in computational methods and statistics to process and analyze data into insight and knowledge. This change in the paradigm of analytical data from explicit to implicit raises the way to extract knowledge from data through a prospective approach to determine the value of new observations based on the structure of the relationship between input and output. Data preparation is a very important stage in predictive analytics. To run quality analytical data, data with good quality is needed in accordance with the criteria. Data quality plays an important role in strategic decision making and planning before the digital computer era. The main challenge faced is that raw data cannot be directly used for analysis. One problem that arises related to data quality is completeness. Missing data is one that often causes data to become incomplete. As a result, predictive analysis generated from these data becomes inaccurate. In this paper we will discuss the problems related to the quality of data in predictive analytics through a literature study from related research. In addition, challenges and directions that might occur in the predictive analytics domain with problems related to data quality will be presented.

BIGQA: Declarative Big Data Quality Assessment

Article

Full-text available

Jun 2023

In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This paper tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole data set each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and 75% over a 25 GB flat file within a distributed environment compared to a non-distributed application.

The Effects of Inaccurate and Missing Highway-Rail Grade Crossing Inventory Data on Crash and Severity Model Estimation and Prediction

Thesis

Full-text available

May 2023

Muhammad Umer Farooq

Highway-Rail Grade Crossings (HRGCs) present a significant safety risk to motorists, pedestrians, and train passengers as they are intersections where roads and railways intersect. Every year, HRGCs in the US experience a high number of crashes leading to injuries and fatalities. Estimations of crash and severity models for HRGCs provide insights into safety and mitigation of the risk posed by such incidents. The accuracy of these models plays a vital role in predicting future crashes at these crossings, enabling necessary safety measures to be taken proactively. In the United States, most of these models rely on the Federal Railroad Administration's (FRA) HRGCs inventory database, which serves as the primary source of information for these models. However, errors or incomplete information in this database can significantly impact the accuracy of the estimated crash model parameters and subsequent crash predictions. This study examined the potential differences in expected number of crashes and severity obtained from the Federal Railroad Administration's (FRA) 2020 Accident Prediction and Severity (APS) model when using two different input datasets for 560 HRGCs in Nebraska. The first dataset was the unaltered, original FRA HRGCs inventory dataset, while the second was a field-validated inventory dataset, specifically for those 560 HRGCs. The results showed statistically significant differences in the expected number of crashes and severity predictions using the two different input datasets. Furthermore, to understand how data inaccuracy impacts model estimation for crash frequency and severity prediction, two new zero-inflated negative binomial models for crash prediction and two ordered probit models for crash severity, were estimated based on the two datasets. The analysis revealed significant differences in estimated parameters’ coefficients values of the base and comparison models, and different crash-risk rankings were obtained based on the two datasets. The results emphasize the importance of obtaining accurate and complete inventory data when developing HRGCs crash and severity models to improve their precision and enhance their ability to predict and prevent crashes.

Detektion von Anomalien in der Datenqualitätskontrolle mittels unüberwachter Ansätze

Thesis

Full-text available

Apr 2024

Jörg Adelhelm

The control of data is a key element in supplying high data quality. A loss of data control can have extensive consequences for organizational efficiency. Various concepts and initiatives have been developed aiming at ensuring data of high quality by detecting anomalies. Although there are various concepts and initiatives regarding to ensure high data quality by detecting anomalies, there is still a lack of practical usage of unsupervised anomaly detection methods in automated data quality control. With respect to this lack, the present work constructs an automated preprocessing pipeline that interacts with unsupervised anomaly detection methods in order to identify errors in data. Unsupervised methods represent a promising tool due to their specialziation in finding deviations in data that do not conform to expected behavior. While the mentioned anomaly detection methods are applied in a variety of different domains there are only few approaches in the field of data quality control. This work addresses this aspect through the use and evaluation of unsupervised anomaly detection methods in the domain of data quality control. By conducting an experiment using real-life data this work presents an approach to assess unsupervised anomaly detection methods in real-world scenarios. The implementation and evaluation are carried out using two different datasets by injecting normal and abnormal entries into the test data. Regarding the evaluation, the findings present promising results in deep learning and subspace-based methods. Further, the subspace method Principal Component analysis has provided the best results due to the method being able to detect a wide range of possible anomalies within the analyzed problem types of data.

Dimensions of data sparseness and their effect on supply chain visibility

Article

Apr 2024
COMPUT IND ENG

Методика оценки качества данных реестра операторов персональных данных

Article

Full-text available

Jan 2024

Рассмотрена проблема наличия пропусков и неточностей в данных реестра операторов, осуществляющих обработку персональных данных, опубликованном на сайте Федеральной службы по надзору в сфере связи, информационных технологий и массовых коммуникаций. Сформирован комплекс инструментов и разработана методика оценки качества данных реестра. Предложенная методика реализована в информационно-аналитической системе, включающей модуль получения данных, репозиторий метаданных и модуль выполнения проверок. Проведены экспериментальные исследования разработанной системы на выборке, включающей данные по 1671 действующему на момент исследования оператору. Проведенная оценка качества данных показала наличие неточностей в данных 12% рассмотренных операторов, включая неполноту данных, наличие признаков несогласованности в данных, присутствие дублей и выход отдельных значений за пределы допустимого диапазона. Информация по обнаруженным ошибкам может быть использована для улучшения процесса взаимодействия сообщества операторов персональных данных и Федеральной службы по надзору в сфере связи, информационных технологий и массовых коммуникаций на основе предложенных в статье рекомендаций, в том числе в формате доработки электронной формы и самопроверки операторов во время ее заполнения.

Dirty Data in the Newsroom: Comparing Data Preparation in Journalism and Data Science

Conference Paper

Apr 2023

Limpeza de Dados -Uma Visão Geral

Article

Full-text available

Resumo. No contexto da actual necessidade de explorar bases de dados, para delas extrair informação/conhecimento para apoio à gestão, é fundamental a correcção/validade dos dados para a qualidade dos resultados extraídos. Sendo certo que são várias as soluções parciais para a resolução dos problemas nos dados, tornou-se necessário fazer uma sistematização de todos os erros que po-dem ocorrer no sentido de identificar aqueles ainda não resolvidos e preconizar uma abordagem global. Este artigo descreve precisamente o referido estudo, e as ilações que se extraem quando se comparam os erros com as abordagens de limpeza de dados actualmente existentes, perspectivando-se a concepção de uma nova aproximação global à limpeza de dados como trabalho futuro, em consequência das conclusões obtidas.

A Taxonomy of Data Quality Problems

Article

Full-text available

Jan 2005

In today's society the exploration of one or more databases to extract information or knowledge to support management is a critical success factor for an organization. However, it is well known that several problems can affect data quality. These problems have a negative effect in the results extracted from data, influencing their correction and validity. In this context, it is quite important to understand theoretically and in practice these data problems. This paper presents a taxonomy of data quality problems, derived from real-world databases. The taxonomy organizes the problems at different levels of abstraction. Methods to detect data quality problems represented as binary trees are also proposed for each abstraction level. The paper also compares this taxonomy with others already proposed in the literature.

Problems, methods, and challenges in comprehensive data cleansing

Article

Full-text available

Jan 2003

Cleansing data from impurities is an integral part of data processing and mainte-nance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This paper pre-sents a survey of data cleansing problems, approaches, and methods. We classify the various types of anomalies occurring in data that have to be eliminated, and we define a set of quality criteria that comprehensively cleansed data has to ac-complish. Based on this classification we evaluate and compare existing ap-proaches for data cleansing with respect to the types of anomalies handled and eliminated by them. We also describe in general the different steps in data clean-sing and specify the methods used within the cleansing process and give an out-look to research directions that complement the existing systems.

Database systems. A practical approach to design, implementation and management

Chapter

Full-text available

Jan 2010

Semantic and Schematic Similarities between Database Objects: A context Based approach

Article

Full-text available

Jan 1996

In a multidatabase system, schematic conflicts between two objects are usually of interest only when the objects have some semantic similarity. We use the concept of semantic proximity, which is essentially an abstraction/mapping between the domains of the two objects associated with the context of comparison. An explicit though partial context representation is proposed and the specificity relationship between contexts is defined. The contexts are organized as a meet semi-lattice and associated operations like the greatest lower bound are defined. The context of comparison and the type of abstractions used to relate the two objects form the basis of a semantic taxonomy. At the semantic level, the intensional description of database objects provided by the context is expressed using description logics. The terms used to construct the contexts are obtained from {\em domain-specific ontologies}. Schema correspondences are used to store mappings from the semantic level to the data level and are associated with the respective contexts. Inferences about database content at the federation level are modeled as changes in the context and the associated schema correspondences. We try to reconcile the dual (schematic and semantic) perspectives by enumerating possible semantic similarities between objects having schema and data conflicts, and modeling schema correspondences as the projection of semantic proximity with respect to (wrt) context.

A Taxonomy of Dirty Data

Article

Full-text available

Jan 2003

Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining, and customer relationship management systems. A major problem that is only beginning to be recognized is that the data in data sources are often "dirty". Broadly, dirty data include missing data, wrong data, and non-standard representations of the same data. The results of analyzing a database/data warehouse of dirty data can be damaging and at best be unreliable. In this paper, a comprehensive classification of dirty data is developed for use as a framework for understanding how dirty data arise, manifest themselves, and may be cleansed to ensure proper construction of data warehouses and accurate data analysis. The impact of dirty data on data mining is also explored.

Data quality assessment[J]

Article

Jan 2002

Data quality through knowledge engineering

Conference Paper

Aug 2003

Traditionally, data quality programs have acted as a preprocessing stage to make data suitable for a data mining or analysis operation. Recently, data quality concepts have been applied to databases that support business operations such as provisioning and billing. Incorporating business rules that drive operations and their associated data processes is critically important to the success of such projects. However, there are many practical complications. For example, documentation on business rules is often meager. Rules change frequently. Domain knowledge is often fragmented across experts, and those experts do not always agree. Typically, rules have to be gathered from subject matter experts iteratively, and are discovered out of logical or procedural sequence, like a jigsaw puzzle. Our approach is to impement business rules as constraints on data in a classical expert system formalism sometimes called production rules. Our system works by allowing good data to pass through a system of constraints unchecked. Bad data violate constraints and are flagged, and then fed back after correction. Constraints are added incrementally as better understanding of the business rules is gained. We include a real-life case study.

Ontology-Based Data Cleaning

Conference Paper

Jun 2002
Lect Notes Comput Sci

Multi-source information systems, such as data warehouses, are composed of a set of heterogeneous and distributed data sources. The relevant information is extracted from these sources, cleaned, transformed and then integrated. The confrontation of two different data sources may reveal different kinds of heterogeneities: at the intensional level, the conflicts are related to the structure of the data. At the extensional level, the conflicts are related to the instances of the data. The process of detecting and solving the conflicts at the extensional level is known as data cleaning. In this paper, we will focus on the problem of differences in terminologies and we propose a solution based on linguistic knowledge provided by a domain ontology. This approach is well suited for application domains with intensive classification of data such as medicine or pharmacology. The main idea is to automatically generate some correspondence assertions between instances of objects. The user can parametrize this generation process by defining a level of accuracy expressed using the domain ontology.

A First Course in Database Systems

Book

Jan 1997

A Formal Definition of Data Quality Problems.

Abstract

Recommended publications

A Taxonomy of Data Quality Problems

AN ONTOLOGY-BASED APPROACH FOR DATA CLEANING (Research-In-Progress)

SmartClean: An Incremental Data Cleaning Tool

An Ontology-Based Approach for Data Cleaning.