Article

Incorporating cardinality constraints and synonym rules into conditional functional dependencies

June 2009
Information Processing Letters 109(14):783-789

June 2009
109(14):783-789

DOI:10.1016/j.ipl.2009.03.021

Source
DBLP

Authors:

Wenguang Chen

Peking University

Wenfei Fan

The University of Edinburgh

Shuai Ma

Beihang University (BUAA)

Abstract We propose an extension of conditional functional dependencies (CFDs), denoted by CFD,s remain NP-complete and coNP-complete, respectively, the same as their counterparts for CFDs. We also identify tractable special cases. Key words: computational complexity, databases, specification languages

Analyses and Validation of Conditional Dependencies with Built-in Predicates

Conference Paper

Full-text available

Aug 2009

This paper proposes a natural extension of conditional func- tional dependencies (cfds (14)) and conditional inclusion dependencies (cinds (8)), denoted by cfdps and cindps, respectively, by specifying pat- terns of data values with 6=; and ‚ predicates. As data quality rules, cfdps and cindps are able to capture errors that commonly arise in practice but cannot be detected by cfds and cinds. We establish two sets of results for central technical problems associated with cfdps and cindps. (a) One concerns the satisflability and implication problems for cfdps and cindps, taken separately or together. These are important for, e.g., deciding whether data quality rules are dirty themselves, and for removing redundant rules. We show that despite the increased ex- pressive power, the static analyses of cfdps and cindps retain the same complexity as their cfds and cinds counterparts. (b) The other concerns validation of cfdps and cindps. We show that given a set § of cfdps and cindps on a database D, a set of sql queries can be automatically generated that, when evaluated against D, return all tuples in D that violate some dependencies in §. This provides commercial dbms with an immediate capability to detect errors based on cfdps and cindps.

Adaptive Data Quality Monitoring with a Focus on the Completeness of Timestamped Data

Thesis

Jan 2018

Gregor Endler

Data quality is an important consideration in many areas. Today's ubiquity of data requires this consideration to span arbitrarily different domains. In both research and business, from medicine to space travel, data quality may be the deciding factor between success and failure. Especially assessing the completeness of data has been identified as an important open problem that cannot yet be solved in many cases where it has impact. Worse still, data quality is context dependent. Thus, the only way that quality of data can be defined is to consult domain experts. To develop solutions that automatically measure data quality, however, professional programmers are required. The diversity of domains where data quality has impact means the two groups rarely intersect. Hence, a solution is necessary that allows domain experts to implement automatic data quality measurement on their own. This thesis presents two approaches to these problems. The first, AdDaQuaM, is a framework for continuous adaptive data quality monitoring. It supports domain experts by allowing them to state data quality requirements in an easy-to-use manner without the need for programming skills. Because some necessary functions are too complex to be integrated in this way, however, the framework also allows arbitrarily complex modules. It integrates all its capabilities into one unified view of a monitored data storage system. AdDaQuaM's evaluation shows that it has the necessary capabilities to enable continuous monitoring of data quality and that it is able to smoothly evolve with changing requirements. The second approach, ForCE, deals in-depth with the problem of data population completeness. In particular, it classifies the completeness of groups of timestamped data. A formal problem statement is provided along with the approach. ForCE's performance is evaluated using test data from the two domains of medical accounting and animal tracking. Lacking a similar system for direct comparison, three baselines that might be used in practice are stated instead. ForCE surpasses them in the majority of cases.

Relaxed Functional Dependencies - A Survey of Approaches

Article

Jan 2016

Recently, there has been a renovated interest in functional dependencies due to the possibility of employing them in several advanced database operations, such as data cleaning, query relaxation, record matching, and so forth. In particular, the constraints defined for canonical functional dependencies have been relaxed to capture inconsistencies in real data, patterns of semantically related data, or semantic relationships in complex data types. In this paper, we have surveyed 35 of such functional dependencies, providing a classification criteria, motivating examples, and a systematic analysis of them.

Extending inclusion dependencies with conditions

Article

Jan 2014
THEOR COMPUT SCI

This paper introduces a class of conditional inclusion dependencies (CINDs), which extends inclusion dependencies (INDs) by enforcing patterns of semantically related data values. We show that CINDs are useful not only in data cleaning, but also in contextual schema matching. We give a full treatment of the static analysis of CINDs, and show that CINDs retain most desired properties of traditional INDs: (a) CINDs are always satisfiable; (b) CINDs are finitely axiomatizable, i.e., there exists a sound and complete inference system for the implication analysis of CINDs; and (c) the implication problem for CINDs has the same complexity as its traditional counterpart, namely, PSPACE-complete, in the absence of attributes with a finite domain; but it is EXPTIME-complete in the general setting. In addition, we investigate the interaction between CINDs and conditional functional dependencies (CFDs), as well as two practical fragments of CINDs, namely acyclic CINDs and unary CINDs. We show the following: (d) the satisfiability problem for the combination of CINDs and CFDs becomes undecidable, even in the absence of finite-domain attributes; (e) in the absence of finite-domain attributes, the implication problem for acyclic CINDs and for unary CINDs retains the same complexity as its traditional counterpart, namely, NP-complete and PTIME, respectively; but in the general setting, it becomes PSPACE-complete and coNP-complete, respectively; and (f) the implication problem for acyclic unary CINDs remains in PTIME in the absence of finite-domain attributes and coNP-complete in the general setting.

Advancing Data Curation With Metadata and Statistical Relational Learning

Book

Full-text available

Mar 2020

Larysa Visengeriyeva

The foundation of every data science project depends on clean data because the quality of the data determines the quality of the insights derived from data by using machine learning or analytics. In this dissertation, we tackle the problem of data cleaning and provide three approaches to advance data error detection and repair: (1) We establish a mapping that reflects the connection between data quality issues and extractable dataset’s metadata, and propose this mapping as a guideline for rapid prototyping of an error detection strategy; (2) We introduce two holistic approaches for effectively combining different error detection strategies to increase the efficacy of error detection. Our methods are based on state-of-the-art ensemble learning algorithms and incorporate the metadata of the dataset; and (3) We propose an approach for addressing data quality issues by formulating a set of data cleaning rules without the manual specification of the rules execution order. The concepts of statistical relational learning and probabilistic inference provide the foundation for our method. We use the Markov logic formalism because it declaratively models data quality rules as first-order logic sentences. Markov logic allows the usage of probabilistic joint inference over data cleaning rules to detect data errors and suggest a repair.

Improving Data Quality by Leveraging Statistical Relational Learning

Conference Paper

Full-text available

Jun 2016

Digitally collected data suffers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints within a single routine, and to automate data curation. We propose an approach to data cleaning based on statistical relational learning (SRL). We argue that a formalism-Markov logic-is a natural fit for modeling data quality rules. Our approach allows for the usage of probabilistic joint inference over interleaved data cleaning rules to improve data quality. Furthermore, it obliterates the need to specify the order of rule execution. We describe how data quality rules expressed as formulas in first-order logic directly translate into the predictive model in our SRL framework.

An Efficient Way for Clustering Using Alternative Decision Tree

Article

Jan 2012
Am J Appl Sci

Problem statement: To Improve the quality of clustering; a Multi-Level Clustering (MLC) algorithm which produces a most accurate cluster with most closely related object using Alternative Decision Tree (ADT) technique is proposed. Approach: Our proposed method combines treeprojection and condition for clustering formation and also is capable to produce a customizable cluster for varying kind of data along with varying number of cluster. Results: The experimental results shows that the proposed system has lower computational complexity, reduce time consumption; most optimize way for cluster formulation and clustering quality compared is compared effectively. Conclusion: The new method offers more accuracy of cluster data without manual intervention at the time of cluster formation. Compared to existing clustering algorithms either partition or hierarchical, our new method is more robust and easy to reach the solution of real world complex business problem.

Logical Schema Design that Quantifies Update Inefficiency and Join Efficiency

Conference Paper

Jun 2021

Probabilistic Cardinality Constraints: Validation, Reasoning, and Semantic Summaries

Article

Jul 2018

Probabilistic databases address the requirements of applications that produce large collections of uncertain data. They should provide declarative means to control the integrity of data. Cardinality constraints, in particular, control the occurrences of data patterns by declaring in how many records a combination of data values can occur. We propose cardinality constraints on probabilistic data, which stipulate lower bounds on the marginal probability by which a cardinality constraint holds. We investigate limits and opportunities for automating their use in integrity control. This includes hardness results for their validation, axiomatic and efficient algorithmic characterisations of their implication problem, and an algorithm that computes succinct semantic summaries for any collection of these constraints. Experiments complement our theoretical analysis on the time and space complexity of computing semantic summaries, suggesting that their computation provides the basis to acquire meaningful constraints. We also establish evidence that probabilistic functional and inclusion dependencies cannot be managed as simply as probabilistic cardinality constraints.

Discovering context-aware conditional functional dependencies

Article

Dec 2016

Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This paper presents context-aware conditional functional dependencies(CCFDs) which contribute to provide reasonable values and detect potential errors. Especially, we focus on automatically discovering minimal CCFDs. In this paper, we present context relativity to measure the relationship of CFDs. The overlap of the related CFDs can provide reasonable values which result in more accuracy consistency repairing, and some related CFDs are combined into CCFDs.Moreover,we prove that discovering minimal CCFDs is NP-complete and we design the precise method and the heuristic method. We also present the dominating value to facilitate the process in both the precise method and the heuristic method. Additionally, the context relativity of the CFDs affects the cleaning results. We will give an approximate threshold of context relativity according to data distribution for suggestion. The repairing results are approvedmore accuracy, even evidenced by our empirical evaluation.

Generic and Declarative Approaches to Data Quality Management

Chapter

Feb 2013

Data quality assessment and data cleaning tasks have traditionally been addressed through procedural solutions. Most of the time, those solutions have been applicable to specific problems and domains. In the last few years we have seen the emergence of more generic solutions, and also of declarative and rule-based specifications of the intended solutions of data cleaning processes. In this chapter we review some of those recent developments.

Extending Conditional Dependencies with Built-in Predicates

Article

Dec 2015

This paper proposes a natural extension of conditional functional dependencies (CFDs [1]) and conditional inclusion dependencies (CINDs [2]), denoted by CFDps and CIND(p)s, respectively, by specifying patterns of data values with not equal, <, <=, >, and >= predicates. As data quality rules, CFDps and CIND(p)s are able to capture errors that commonly arise in practice but cannot be detected by CFDs and CINDs. We establish two sets of results for central technical problems associated with CFD(p)s and CIND(p)s. (a) One concerns the satisfiability and implication problems for CFD(p)s and CIND(p)s, taken separately or together. These are important for, e.g. deciding whether data quality rules are dirty themselves, and for removing redundant rules. We show that despite the increased expressive power, the static analyses of CFD(p)s and CIND(p)s retain the same complexity as their CFDs and CINDs counterparts. (b) The other concerns validation of CFD(p)s and CIND(p)s. We show that given a set Sigma of CFD(p)s and CIND(p)s on a database D, a set of SQL queries can be automatically generated that, when evaluated against D, return all tuples in D that violate some dependencies in Sigma. We also experimentally verified the efficiency and effectiveness of our SQL based error detection techniques, using real-life data. This provides commercial DBMS with an immediate capability to detect errors based on CFD(p)s and CIND(p)s.

Foundations of Data Quality Management

Book

Jul 2012

Efficient derivation of numerical dependencies

Article

May 2013
INFORM SYST

Numerical dependencies (NDs) are database constraints that limit the number of distinct Y-values that can appear together with any X-value, where both X and Y are sets of attributes in a relation schema. While it is known that NDs are not finitely axiomatizable, there is no study on how to efficiently derive NDs using a set of sound (yet necessarily incomplete) rules. In this paper, after proving that solving the entailment problem for NDs using the chase procedure has exponential space complexity, we show that, given a set of inference rules similar to those used for functional dependencies, the membership problem for NDs is NP-hard. We then provide a graph-based characterization of NDs, which is exploited to design an efficient branch & bound algorithm for ND derivation. Our algorithm adopts several optimization strategies that provide considerable speed-up over a naïve approach, as confirmed by the results of extensive tests we made for efficiency and effectiveness using six different datasets.

Foundations of Databases

Chapter

Full-text available

Jan 1995

The complexity and approximation of fixing numerical attributes in databases under integrity constraints

Article

Full-text available

Apr 2005
INFORM SYST

Consistent query answering is the problem of characterizing and computing the semantically correct answers to queries from a database that may not satisfy certain integrity constraints. Consistent answers are characterized as those answers that are invariant under all minimally repaired versions of the original database. We study the problem of repairing databases with respect to denial constraints by fixing integer numerical values taken by attributes. We introduce a quantitative definition of database repair, and investigate the complexity of several decision and optimization problems. Among them, Database Repair Problem (DRP): deciding the existence of repairs within a given distance to the original instance, and CQA: deciding consistency of answers to simple and aggregate conjunctive queries under different semantics. We provide sharp complexity bounds, identifying relevant tractable and intractable cases. We also develop approximation algorithms for the latter. Among other results, we establish: (a) The -hardness of CQA. (b) That DRP is MAXSNP-hard, but has a good approximation. (c) The intractability of CQA for aggregate queries for one database atom denials (plus built-ins), and also that it has a good approximation.

Consistent Query Answers in Inconsistent Databases

Conference Paper

Full-text available

May 1999

In this paper we consider the problem of the logical characterization of the notion of consistent answer in a relational database that may violate given integrity constraints. This notion is captured in terms of the possible repaired versions of the database. A method for computing consistent answers is given and its soundness and completeness (for some classes of constraints and queries) proved. The method is based on an iterative procedure whose termination for several classes of constraints is proved as well.

Improving Data Quality: Consistency and Accuracy.

Conference Paper

Full-text available

Jan 2007

Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D' that satisfies the constraints and "minimally" differs from D. Equally important is to ensure that the automatically-generated repair D' is accurate, or makes sense, i.e., D' differs from the "correct" data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D' that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction.

Discovering data quality rules

Article

Full-text available

Aug 2008

Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we propose a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records). We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.

On generating near-optimal tableaux for conditional functional dependencies

Article

Full-text available

Aug 2008

Conditional functional dependencies (CFDs) have recently been proposed as a useful integrity constraint to summarize data semantics and identify data inconsistencies. A CFD augments a functional dependency (FD) with a pattern tableau that defines the context (i.e., the subset of tuples) in which the underlying FD holds. While many aspects of CFDs have been studied, including static analysis and detecting and repairing violations, there has not been prior work on generating pattern tableaux, which is critical to realize the full potential of CFDs. This paper is the first to formally characterize a "good" pattern tableau, based on naturally desirable properties of support, confidence and parsimony. We show that the problem of generating an optimal tableau for a given FD is NP-complete but can be approximated in polynomial time via a greedy algorithm. For large data sets, we propose an "on-demand" algorithm providing the same approximation bound, that outperforms the basic greedy algorithm in running time by an order of magnitude. For ordered attributes, we propose the range tableau as a generalization of a pattern tableau, which can achieve even more parsimony. The effectiveness and efficiency of our techniques are experimentally demonstrated on real data.

Conditional functional dependencies for capturing data inconsistencies

Article

Full-text available

Jun 2008

We propose a class of integrity constraints for relational databases, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by enforcing bindings of semantically related values. For static analysis of CFDs we investigate the consistency problem, which is to determine whether or not there exists a nonempty database satisfying a given set of CFDs, and the implication problem, which is to decide whether or not a set of CFDs entails another CFD. We show that while any set of transitional FDs is trivially consistent, the consistency problem is NP-complete for CFDs, but it is in PTIME when either the database schema is predefined or no attributes involved in the CFDs have a finite domain. For the implication analysis of CFDs, we provide an inference system analogous to Armstrong's axioms for FDs, and show that the implication problem is coNP-complete for CFDs in contrast to the linear-time complexity for their traditional counterpart. We also present an algorithm for computing a minimal cover of a set of CFDs. Since CFDs allow data bindings, in some cases CFDs may be physically large, complicating the detection of constraint violations. We develop techniques for detecting CFD violations in SQL as well as novel techniques for checking multiple constraints by a single query. We also provide incremental methods for checking CFDs in response to changes to the database. We experimentally verify the effectiveness of our CFD-based methods for inconsistency detection. This work not only yields a constraint theory for CFDs but is also a step toward a practical constraint-based method for improving data quality.

Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity

Conference Paper

Full-text available

May 2008

The paper proposes an extension of CFDs [1], referred to as extended Conditional Functional Dependencies (eCFDs). In contrast to CFDs, eCFDs specify patterns of semantically related values in terms of disjunction and inequality, and are capable of catching inconsistencies that arise in practice but cannot be detected by CFDs. The increase in expressive power does not incur extra complexity: we show that the satisfiability and implication analyses of eCFDs remain NP - complete and coNP -complete, respectively, the same as their CFDs counterparts. In light of the intractability, we present an algorithm that approximates the maximum number of eCFDs that are satisfiable. In addition, we revise SQL techniques for detecting CFD violations, and show that violations of multiple eCFDs can be captured via a single pair of SQL queries. We also introduce an incremental SQL technique for detecting eCFD violations in response to database updates. We experimentally verify the effectiveness and efficiency of our SQL -based detection methods.

Transformation-based Framework for Record Matching

Conference Paper

Full-text available

May 2008

Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We propose a programmatic framework of record matching that takes such user-defined string transformations as input. To the best of our knowledge, this is the first proposal for such a framework. This transformational framework, while expressive, poses significant computational challenges which we address. We empirically evaluate our techniques over real data.

On the interaction between ISA and cardinality constraints

Conference Paper

Full-text available

Mar 1994

ISA and cardinality constraints are among the most interesting types of constraints in data models. ISA constraints are used to establish several forms of containment among classes, and are receiving great attention in moving to object-oriented data models, where classes are organized in hierarchies based on a generalization/specialization principle. Cardinality constraints impose restrictions on the number of links of a certain type involving every instance of a given class, and can be used for representing several forms of dependencies between classes, including functional and existence dependencies. While the formal properties of each type of constraints are now well understood, little is known of their interaction. We present an effective method for reasoning about a set of ISA and cardinality constraints in the context of a simple data model based on the notions of classes and relationships. In particular, the method allows one both to verify the satisfiability of a schema and to check whether a schema implies a given constraint of any of the two kinds. We prove that the method is sound and complete, thus showing that the reasoning problem for ISA and cardinality constraints is decidable

Constrained Dependencies.

Article

Feb 1997
THEOR COMPUT SCI

Michael J. Maher

We extend the notions of functional and finiteness dependencies to apply to subsets of a relation that are specified by constraints. These dependencies have many applications. We are able to characterize those constraint domains which admit a polynomial time solution of the implication problem (assuming P≠NP) and give an efficient algorithm for these cases, modulo the cost of constraint manipulation. For other cases we offer approximate algorithms. Finally, we outline some applications of these dependencies to the analysis and optimization of CLP programs and database queries.

Computers And Intractability: A Guide to the Theory of NP-Completeness

Chapter

Jan 1979

Constrained Dependencies

Chapter

Jan 2006
THEOR COMPUT SCI

Michael J. Maher

Constraint-Generating Dependencies

Article

Aug 1999
J COMPUT SYST SCI

Traditionally, dependency theory has been developed for uninterpreted data. Specifically, the only assumption that is made about the data domains is that data values can be compared for equality. However, data is often interpreted and there can be advantages in considering data as such, for instance, obtaining more compact representations as is done in constraint databases. This paper considers dependency theory in the context of interpreted data. Specifically, it studies constraint-generating dependencies. These are a generalization of equality-generating dependencies where equality requirements are replaced by constraints on an interpreted domain. The main technical results in the paper are a general decision procedure for the implication and consistency problems for constraint-generating dependencies and complexity results for specific classes of such dependencies over given domains. The decision procedure proceeds by reducing the dependency problem to a decision problem for the constraint theory of interest and is applicable as soon as the underlying constraint theory is decidable. The complexity results are, in some cases, directly lifted from the constraint theory; in other cases, optimal complexity bounds are obtained by taking into account the specific form of the constraint decision problem obtained by reducing the dependency implication problem.

Scalar aggregation in inconsistent databases

Article

Aug 2002
THEOR COMPUT SCI

We consider here scalar aggregation queries in databases that may violate a given set of functional dependencies. We define consistent answers to such queries to be greatest-lowest/least-upper bounds on the value of the scalar function across all (minimal) repairs of the database. We show how to compute such answers. We provide a complete characterization of the computational complexity of this problem. We also show how tractability can be improved in several special cases (one involves a novel application of Boyce–Codd Normal Form) and present a practical hybrid query evaluation method.

Consistent Query Answering: Five Easy Pieces

Conference Paper

Jan 2007

Jan Chomicki

Consistent query answering (CQA) is an approach to query- ing inconsistent databases without repairing them first. This invited talk introduces the basics of CQA, and discusses selected issues in this area. The talk concludes with a summary of other relevant work and an outline of potential future research topics.

Conditional Dependencies for Horizontal Decompositions.

Conference Paper

Jul 1983

A new decomposition theory for functional dependencies in the Relational Database Model is given. It uses a method to break up a relation into two subrelations whose union is the given relation. This horizontal decomposition is based on a new constraint: the conditional-functional dependency. It indicates how to decompose a relation into two restrictions of this relation. The only difference between the two subrelations is a functional dependency that holds in one subrelation but not in the other. Functional dependencies can be expressed as special conditional-functional dependencies. The membership problem is solved for this new constraint, and also for another constraint, induced by the horizontal decomposition: the afunctional dependency. An algorithm is described that performs the decomposition. It uses a new normal form: the Conditional Normal Form. The link between the horizontal- and the traditional vertical decomposition is explained.

Computers and Intracdtability: A Guide to the Theory of NP-Completeness

Book

Jan 1979

Normalization and Axiomatization for Numerical Dependencies

Article

Apr 1985
Inform Contr

We show how to use both horizontal and vertical decomposition to normalize a database schema which contains numerical dependencies. We present a finite set of inference rules for numerical dependencies which is a generalization of the Armstrong axioms. We prove that this set is sound and complete for some special cases.

On the Computational Complexity of Cardinality Constraints in Relational Databases.

Article

Oct 1980
INFORM PROCESS LETT

Paris C. Kanellakis

Eliminating Fuzzy Duplicates in Data Warehouses

Article

Aug 2002

The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

Incorporating cardinality constraints and synonym rules into conditional functional dependencies

Abstract

No full-text available

Recommended publications

Students misconceptions in analysis of algorithmic and computational complexity of problems