Conference PaperPDF Available

Renormalization of NoSQL Database Schemas: 37th International Conference, ER 2018, Xi'an, China, October 22–25, 2018, Proceedings

Authors:

Abstract

NoSQL applications often use denormalized databases in order to meet performance goals, but this introduces complications as the database itself has no understanding of application-level denormalization. In this paper, we describe a procedure for reconstructing a normalized conceptual model from a denormalized NoSQL database. The procedure’s input includes functional and inclusion dependencies, which may be mined from the NoSQL database. Exposing a conceptual model provides application developers with information that can be used to guide application and database evolution.
Renormalization of NoSQL Database Schemas
Michael J. Mior and Kenneth Salem
Cheriton School of Computer Science, University of Waterloo
Abstract. NoSQL applications often use denormalized databases in order to
meet performance goals, but this introduces complications as the database it-
self has no understanding of application-level denormalization. In this paper, we
describe a procedure for reconstructing a normalized conceptual model from a
denormalized NoSQL database. The procedure’s input includes functional and
inclusion dependencies, which may be mined from the NoSQL database. Expos-
ing a conceptual model provides application developers with information that can
be used to guide application and database evolution.
Keywords: Renormalization, NoSQL, Database design, Conceptual modeling
1 Introduction
NoSQL databases, such as Apache Cassandra, Apache HBase, and MongoDB, have
grown in popularity, despite their limitations. This is due in part to their performance
and scalability, and in part because they adopt a flexible approach to database schemas.
Because these systems do not provide high-level query languages, applications must de-
normalize and duplicate data across physical structures in the database to answer com-
plex queries. Unfortunately, the NoSQL database itself typically has no understanding
of this denormalization. Thus, it is necessary for applications to operate directly on
physical structures, coupling applications to a particular physical design.
Although NoSQL systems may not require applications to define rigid schemas, ap-
plication developers must still decide how to store information in the database. These
choices can have a significant impact on application performance as well as the read-
ability of application code [6]. For example, consider an application using HBase to
track requests made to an on-line service. The same requests may be stored in multiple
tables since the structures available determine which queries can be asked. The choice
of data representation depends on how the application expects to use the table, i.e., what
kinds of queries and updates it needs to perform. Since the NoSQL system itself is un-
aware of these application decisions, it can provide little to no help in understanding
what is being represented in the database.
Together, the lack of physical data independence and the need for workload-tuned,
denormalized database designs creates challenges for managing and understanding phys-
ical schemas, especially as applications evolve. In our on-line service example, request
information might be stored twice, once grouped and keyed by the customer that sub-
mitted the request, and a second time keyed by the request subject or the request time. If
the application updates a request, or changes the information it tracks for each request,
these changes should be reflected in both locations. Unless the application developer
The final authenticated version is available online at https://doi.org/10.1007/978-3-030-00847-5_34
Project ProjID
ProjName
WorkingOn
Employee
EmpID
EmpName
Department
DeptID
DeptName
WorksIn
Manages
Fig. 1: Schema example after renormalization
maintains external documentation, the only knowledge of this denormalization is em-
bedded within the source code. We aim to surface this knowledge by generating a useful
conceptual model of the data.
We refer to this surfacing task as schema renormalization. This work addresses the
schema renormalization problem through the following technical contributions:
We present a semi-automatic technique for extracting a normalized conceptual
model from an existing denormalized NoSQL database. It produces a normalized
conceptual model for the database, such as the one shown in Figure 1.
We develop an normalization algorithm in Section 5 which forms the core of our
approach. This algorithm uses data dependencies to extract a conceptual model
from the NoSQL system’s physical structures. Our algorithm ensures that the re-
sulting model is free from redundancy implied by these dependencies. To the best
of our knowledge, this is the first normalization algorithm to produce a schema in
interaction-free inclusion dependency normal form [9].
Finally, Section 6, presents a case study which shows the full schema renormaliza-
tion process in action for a NoSQL application. We use this case study to highlight
both the advantages and the limitations of our approach to renormalization.
The conceptual data model that our algorithm produces can serve as a simple refer-
ence, or specification, of the information that has been denormalized across the workload-
tuned physical database structures. We view this model as a key component in a broader
methodology for schema management for NoSQL applications.
2 Renormalization Overview
We renormalize NoSQL databases using a three step process. The first step is to pro-
duce a generic physical schema describing the physical structures that are present in the
NoSQL database. We describe the generic physical model in more detail in Section 3,
and illustrate how it can be produced for different types of NoSQL systems. The second
step is to identify dependencies among the attributes of the generic model. The depen-
dencies, which we discuss in Section 4, can be provided by a user with understanding of
the NoSQL system’s application domain or automatically using existing mining tech-
niques. We provide a brief overview of these steps in the following sections. More detail
is available in an extended technical report [14].
The final step in the renormalization process is to normalize the generic physical
schema using the dependencies, resulting in a logical schema such as the one repre-
sented (as an ER diagram) in Figure 1. This step is automated, using the procedure
described in Section 5. Our algorithm ensures that redundancy in the physical schema
captured by the provided dependencies is removed.
3 The Generic Physical Schema
The first step in the renormalization process is to describe the NoSQL database using
a generic schema. The schemas we use are relational. Specifically, a generic physical
schema consists of a set of relation schemas. Each relation schema describes a physical
structure in the underlying NoSQL database (e.g., a document collection). A relation
schema consists of a unique relation name plus a set of attribute names. Attribute names
are unique within each relation schema.
If the NoSQL database includes a well-defined schema, then describing the physical
schema required for renormalization is a trivial task. The generic schema simply iden-
tifies the attributes that are present in the table, and gives names to both the attributes
and the table itself. For example, Cassandra stores table definitions which can directly
provide the generic schema.
In general, we anticipate that the definition of a generic physical schema for an
application will require user involvement. However, there are tools that may assist with
this process. For example, Izquierdo et al. [7] have proposed a method for extracting a
schema from JSON records in a document store, which could be applied to extract the
generic physical schema required for renormalization.
4 Dependency Input
The second step of the renormalization process is to identify dependencies among at-
tributes in the generic physical schema. Our algorithm uses two types of dependencies:
functional dependencies (FDs) and inclusion dependencies (INDs). These two forms of
dependencies are easy to express and are commonly used in database design [11].
For input to our algorithm, we require that all INDs are superkey-based. That is,
for an IND R(A)S(B),Bmust be a superkey of S. We do not believe that this
is a significant restriction since we intend for INDs to be used to indicate foreign key
relationships which exist in the denormalized data. Indeed, Mannila and R¨
aih¨
a [11]
have previously argued that only key-based dependencies are relevant to logical design.
5 Normalization Algorithm
Levene and Vincent [9] define a normal form for database relations involving FDs and
INDs referred to as inclusion dependency normal form (IDNF). They have shown that
Data: A set of relations R, FDs F, and INDs I
Result: A normalized set of relations R000 and new dependencies F0and I+000
begin
F0,I+ Expand(F,I);// Perform dependency inference
R0
,I+0
BCNFDecompose(R,F0,I+);// BCNF normalization
R00
,I+00
Fold(R0
,F0
,I+0
);// Remove attributes/relations
R000,I+000
BreakCycles(R00
,I+00
);// Break circular INDs
end
Fig. 2: Algorithm for normalization to interaction-free IDNF
normalizing according to IDNF removes redundancy from a database design implied
by the set of dependencies. However, one of the necessary conditions for this normal
form is that the set of INDs is non-circular. This excludes useful schemas which ex-
press constraints such as one-to-one foreign key integrity. For example, for the relations
R(A,B)and S(B,C)we can think of the circular INDs R(A)=S(B)as expressing
a one-to-one foreign key between R(A)and S(B).
Levene and Vincent also propose an extension to IDNF, termed interaction-free
inclusion dependency normal form which allows such circularities. The goal of our
normalization algorithm is to produce a schema that is in interaction-free IDNF. This
normal form avoids redundancy implied by FDs and INDs while still allowing the ex-
pression of useful information such as foreign keys. As we show in Section 6, this
produces useful logical models for a real-world example.
Figure 2 provides an overview of our normalization algorithm, which consists of
four stages. In the reminder of this section, we discuss the normalization algorithm
in more detail. We will make use of a running example based on the simple generic
(denormalized) physical schema and dependencies shown in Figure 3.
5.1 Dependency Inference
To minimize the effort required to provide input needed to create a useful normalized
schema, we aim to infer dependencies whenever possible. Armstrong [1] provides a
well-known set of axioms which can be used to infer FDs from those provided as input.
Similarly, Mitchell [15] presents a similar set of inference rules for INDs.
Mitchell further presents a set of inference rules for joint application to a set of FDs
and INDs. We adopt Mitchell’s rules to infer new FDs for INDs and vice versa. The
pullback rule enables new FDs to be inferred from FDs and INDs. The collection rule
allows the inference of new INDs. These new dependencies allow the elimination of
attributes and relations via the Fold algorithm (see Section 5.3) to reduce the size of
the resulting schema while maintaining the same semantic information.
There is no complete axiomatization for FDs and INDs taken together [3]. Our
Expand procedure, which uses Mitchell’s pullback and collection rules for inference
from FDs and INDs, is sound but incomplete. However, it does terminate, since the
universe of dependencies is finite and the inference process is purely additive. Although
Physical Schema
EmpProjects(EmpID,EmpName,ProjID,ProjName)
Employees(EmpID,EmpName,DeptID,DeptName)
Managers(DeptID,EmpID)
Functional Dependencies
EmpProjects :EmpID !EmpName Employees :EmpID !EmpName,DeptID
EmpProjects :ProjID !ProjName Employees :DeptID !DeptName
Managers :DeptID !EmpID
Inclusion Dependencies
EmpProjects (EmpID,EmpName)Employees (...)
Employees (DeptID)Managers (...)
Managers (EmpID)Employees (...)
When attributes have the same names, we use ...on the right.
Fig. 3: Example generic physical schema and dependencies.
Expand may fail to infer some dependencies that are implied by the given set of FDs
and INDs, it is nonetheless able to infer dependencies that are useful for schema design.
5.2 BCNF Decomposition
The second step, BCNFDecompose, is to perform a lossless join BCNF decomposition
of the physical schema using the expanded set of FDs. When relations are decomposed,
we project the FDs and INDs from the original relation to each of the relations resulting
from decomposition. In addition, we add new INDs which represent the correspondence
of attributes between the decomposed relations. For example, when performing the de-
composition R(ABC )!R0(AB),R
00 (BC)we also add the INDs R0(B)R00 (B)
and R00 (B)R0(B). In our running example, we are left with the relations and de-
pendencies shown in Figure 4 after the Expand and BCNFDecompose steps.
5.3 Folding
Casanova and de Sa term the technique of removing redundant relations folding [2].
A complete description of our algorithm, Fold, is given in an extended technical re-
port [14]. Fold identifies attributes or relations which are recoverable from other rela-
tions. Specifically, folding removes attributes which can be recovered by joining with
another relation and relations which are redundant because they are simply a projection
of other relations. Fold also identifies opportunities for merging relations sharing a
common key.
Consider the EmpProjects0relation which contains the EmpName attribute. Since
we have the IND EmpProjects’(EmpID,EmpName) Employees(...)and
Physical Schema
Employees EmpID,EmpName,DeptIDDepartments DeptID,DeptName
EmpProjects EmpID,ProjIDEmpProjects0EmpID,EmpName
Managers DeptID,EmpIDProjects ProjID,ProjName
Functional Dependencies
Employees :EmpID !EmpName,DeptID Departments :DeptID !DeptName
Projects :ProjID !ProjName Managers :DeptID !EmpID
EmpProjects0:EmpID !EmpName
Inclusion Dependencies
Projects (ProjID)=EmpProjects (...)EmpProjects (EmpID)Employees (...)
EmpProjects0(EmpID)=EmpProjects (...)Managers (DeptID)Departments (...)
EmpProjects0(EmpID,EmpName)Employees (...)Managers Employees (...)
Fig. 4: Relations and dependencies after BCNF decomposition. Note that =is used to
represent bidirectional inclusion dependencies.
the FD Employees: EmpID !EmpName we can infer that the EmpName at-
tribute in EmpProjects0is redundant since it can be recovered by joining with the
Employees relation.
5.4 Breaking IND Cycles
Interaction-free IDNF requires that the final schema be free of circular INDs. Mannila
and R¨
aih¨
a [11] use a technique, which we call BreakCycles, to break circular INDs
when performing logical database design. We adopt this technique to break IND cycles
which are not proper circular.
5.5 IDNF
The goal of our normalization algorithm is to produce a schema that is in interaction-
free IDNF with respect to the given dependencies. The following conditions are suf-
ficient to ensure that a set of relations Ris in interaction-free IDNF with respect to
a set of FDs Fand INDs I: (1) Ris in BCNF [5] with respect to F, (2) all the INDs
in Iare key-based or proper circular, and (3) Fand Ido not interact. A set of INDs
is proper circular if for each circular inclusion dependency over a unique set of re-
lations R1(X1)R2(Y2),R
2(X2)R3(Y3),...,R
m(Xm)R1(Y1), we have
Xi=Yifor all i. The schema produced by the normalization algorithm of Figure 2 is
in interaction-free IDNF. We provide a proof in an extended technical report [14].
6 RUBiS Case Study
In previous work, we developed a tool called NoSE [13], which performs automated
schema design for NoSQL systems. We used NoSE to generate two Cassandra schemas,
each optimized for a different workload (a full description is given in an extended tech-
nical report [14]). In each case, NoSE starts with a conceptual model of the database
which includes six types of entities (e.g., users, and items) and relationships between
them. The two physical designs consist of 9 and 14 Cassandra column families.
Our case study uses NoSE’s denormalized schemas as input to our algorithm so
that we can compare the schemas that it produces with the original conceptual model.
For each physical schema, we tested our algorithm with two sets of dependencies: one
manually generated from the physical schema, and a second mined from an instance of
that schema using techniques discussed in an extended technical report [14]. The first
set of dependencies resulted in a conceptual model that was identical (aside from names
of relations and attributes) to the original conceptual model used by NoSE, as desired.
For the second set of tests, renormalization produced the original model for the
smaller Cassandra schema. The mining process identified 61 FDs and 314 INDs. Rank-
ing heuristics were critical to this success. Without them, spurious dependencies lead to
undesirable entities in the output schema. For the larger schema, mining found 86 FDs
and 600 INDs, many of them spurious, resulting in a model different from the original.
No set of heuristics will be successful in all cases and this is an area for future work.
These examples show that FDs and INDs are able to drive meaningful denormaliza-
tion. Runtime for the normalization step of our algorithm was less than one second on
a modest desktop workstation in all cases.
7 Related Work
Much of the existing work in normalization revolves around eliminating redundancy in
relational tables based on different forms of dependencies. However, it does not deal
with the case where applications duplicate data across relations. Inclusion dependen-
cies are a natural way to express this duplication. Other researchers have established
normal forms using inclusion dependencies [11, 10, 9] in addition to FDs. Our approach
borrows from Manilla and R¨
aih¨
a, who present a variant of a normal form involving in-
clusion dependencies and an interactive normalization algorithm. However, it does not
produce useful schemas in the presence of heavily denormalized data. Specifically, their
approach is not able to eliminate all data duplicated in different relations.
A related set of work exists in database reverse engineering (DBRE). The goal of
DBRE is to produce an understanding of the semantics of a database instance, com-
monly through the construction of a higher level model of the data. Unlike our work,
many approaches [4, 17] present only an informal process and not a specific algorithm.
There is significant existing work in automatically mining both functional [12] and
inclusion [8] dependencies from both database instances and queries. These approaches
complement our techniques since we can provide the mined dependencies as input into
our algorithm. Papenbrock and Naumann [16] present of heuristics for making use of
mined dependencies to normalize a schema according to BCNF. We leverage these to
incorporate mining into our algorithm as discussed in an extended technical report [14].
8 Conclusions and Future Work
We have developed a methodology for transforming a denormalized physical schema in
a NoSQL datastore into a normalized logical schema. Our method makes use of func-
tional and inclusion dependencies to remove redundancies commonly found in NoSQL
database designs. We further showed how we can make use of dependencies which were
mined from a database instance to reduce the input required from users. Our method has
a variety of applications, such as enabling query execution against the logical schema
and guiding schema evolution as application requirements change.
References
1. Armstrong, W.W.: Dependency structures of data base relationships. In: IFIP Congress. pp.
580–583 (1974)
2. Casanova, M.A., de Sa, J.E.A.: Mapping uninterpreted schemes into entity-relationship dia-
grams: Two applications to conceptual schema design. IBM Journal of Research and Devel-
opment 28(1), 82–94 (Jan 1984)
3. Casanova, M.A., et al.: Inclusion dependencies and their interaction with functional depen-
dencies. Journal of Computer and System Sciences 28(1), 29–59 (1984)
4. Chiang, R.H., et al.: Reverse engineering of relational databases: Extraction of an EER model
from a relational database. Data & Knowledge Engineering 12(2), 107–142 (Mar 1994)
5. Codd, E.F.: Recent investigations into relational data base systems. Tech. Rep. RJ1385, IBM
(Apr 1974)
6. G´
omez, P., Casallas, R., Roncancio, C.: Data schema does matter, even in NoSQL systems!
In: RCIS 2016 (June 2016)
7. Izquierdo, J.L.C., Cabot, J.: Discovering Implicit Schemas in JSON Data, p. 6883. Springer,
Berlin, Heidelberg (Jul 2013)
8. Kantola, M., et al.: Discovering functional and inclusion dependencies in relational
databases. Intl. Journal of Intelligent Systems 7(7), 591–607 (1992)
9. Levene, M., Vincent, M.W.: Justification for inclusion dependency normal form. IEEE
TKDE 12(2), 281–291 (Mar 2000)
10. Ling, T.W., Goh, C.H.: Logical database design with inclusion dependencies. In: ICDE ’92.
pp. 642–649 (Feb 1992)
11. Mannila, H., R¨
aih¨
a, K.J.: Inclusion dependencies in database design, pp. 713–718. IEEE
Computer Society (Feb 1986)
12. Mannila, H., R¨
aih¨
a, K.J.: Algorithms for inferring functional dependencies from relations.
DKE 12(1), 83–99 (1994)
13. Mior, M.J., Salem, K., Aboulnaga, A., Liu, R.: NoSE: Schema design for NoSQL applica-
tions. In: ICDE 2016. pp. 181–192 (May 2016)
14. Mior, M.J., Salem, K.: Renormalization of NoSQL database schemas. Tech. Rep. CS-2017-
02, University of Waterloo (2017)
15. Mitchell, J.C.: Inference Rules for Functional and Inclusion Dependencies, pp. 58–69. PODS
’83, ACM (1983)
16. Papenbrock, T., Naumann, F.: Data-driven schema normalization. In: Proceedings of EDBT
2017. pp. 342–353 (2017)
17. Premerlani, W.J., Blaha, M.R.: An approach for reverse engineering of relational databases.
Commun. ACM 37(5), 42–49 (May 1994)
... We have studied the evolution of the schema of two open-source projects using Apache Cassandra to store their data [6]: Minds 1 and PowSybl 2 . Minds is a social network with 35 schema versions. ...
... PowSybl is a framework for real and simulated power systems which also has 35 schema versions. In previous work [6], we focused on issues caused by an incorrect evolution of the database due to lack of consideration of additional information contained in the conceptual model, such as where the data are duplicated. One of the identified changes that is prone to cause these issues is the addition of a new column. ...
... Stage 1 aims to maintain inter-model consistency after a change to the schema by performing the required updates to the conceptual model to reflect the change in the schema. We propose using existing work that studies the renormalization of the schema in a normalized conceptual model [6,7], which generate a normalized conceptual model based on a column-oriented schema. We use past work focused on column-oriented databases [6] as a guideline for determining the specific changes to be performed in the conceptual model for each change in the schema. ...
Chapter
Schema design for NoSQL column-oriented database applications follows a query-driven strategy where each table satisfies a query that will be executed by the client application. This strategy usually implies that the schema is denormalized as the same information can be queried several times in different ways, leading to data duplication in the database. Because the schema does not provide information such as where the data is duplicated or the relationships between conceptual entities, developers must use additional information when evolving the database. One strategy for accessing this information is to use a conceptual model that must be synchronized and kept consistent with the physical schema. In this work, we propose evolving a column-oriented database application after a schema change with a combination of methods that consists of four sequential stages: 1) reflect the schema change in the conceptual model, 2) take the necessary actions in the schema to maintain consistency between the new conceptual model and the schema, 3) maintain data integrity through migration of data and 4) update and adapt the client application to the new schema.
Article
NoSQL stores have become ubiquitous since they offer a new cost-effective and schema-free system. Although NoSQL systems are widely accepted today, Business Intelligence & Analytics (BI&A) wields relational data sources. Exploiting schema-free data for analytical purposes is a challenge since it requires reviewing all the BI&A phases, particularly the Extract-Transform-Load (ETL) process, to fit big data sources as document stores. In the ETL process, the join of several collections, with a lack of explicitly known join fields is a significant dare. Detecting these fields manually is time and effort-consuming and infeasible in large-scale datasets. In this paper, we study the problem of discovering join fields automatically. We introduce an algorithm that aims to automatically detect both identifiers and references on several document stores. The modus operandi of our approach underscores three core stages: (i) global schema extraction; (ii) discovery of candidate identifiers; and (iii) identifying candidate pairs of identifier and reference fields. We use scoring features and pruning rules to discover true candidate identifiers from many initial ones efficiently. To find candidate pairs between several document stores, we put into practice node2vec as a graph embedding technique, which yields significant advantages while using syntactic and semantic similarity measures for pruning pointless candidates. Finally, we report our experimental findings that show encouraging results.
Conference Paper
Data modeling in Cassandra databases follows a query-driven approach where each table is created to satisfy a query, leading to repeated data as the Cassandra model is not normalized by design. Consequently, developers bear the responsibility to maintain the data integrity at the application level, as opposed to when the model is normalized. This is done by embedding in the client application the appropriate statements to perform data changes, which is error prone. Cassandra data modeling methodologies have emerged to cope with this problem by proposing the use of a conceptual model to generate the logical model, solving the data modeling problem but not the data integrity one. In this thesis we address the problem of the integrity of these data by proposing a method that, given a data change at either the conceptual or the logical level, determines the executable statements that should be issued to preserve the data integrity. Additionally, as this integrity may also be lost as a consequence of creating new data structures in the logical model, we complement our method to preserve data integrity in these scenarios. Furthermore, we address the creation of data structures at the conceptual level to represent a normalized version of newly created data structures in the logical model.
Conference Paper
Full-text available
Database design is critical for high performance in relational databases and many tools exist to aid application designers in selecting an appropriate schema. While the problem of schema optimization is also highly relevant for NoSQL databases, existing tools for relational databases are inadequate for this setting. Application designers wishing to use a NoSQL database instead rely on rules of thumb to select an appropriate schema. We present a system for recommending database schemas for NoSQL applications. Our cost-based approach uses a novel binary integer programming formulation to guide the mapping from the application's conceptual data model to a database schema. We implemented a prototype of this approach for the Cassan-dra extensible record store. Our prototype, the NoSQL Schema Evaluator (NoSE) is able to capture rules of thumb used by expert designers without explicitly encoding the rules. Automating the design process allows NoSE to produce efficient schemas and to examine more alternatives than would be possible with a manual rule-based approach.
Conference Paper
Full-text available
JSON has become a very popular lightweigth format for data exchange. JSON is human readable and easy for computers to parse and use. However, JSON is schemaless. Though this brings some benefits (e.g., flexibility in the representation of the data) it can become a problem when consuming and integrating data from different JSON services since developers need to be aware of the structure of the schemaless data. We believe that a mechanism to discover (and visualize) the implicit schema of the JSON data would largely facilitate the creation and usage of JSON services. For instance, this would help developers to understand the links between a set of services belonging to the same domain or API. In this sense, we propose a model-based approach to generate the underlying schema of a set of JSON documents.
Article
Full-text available
A method of mapping sets of uninterpreted record or relation schemes into entity-relationship diagrams is described and then applied to two conceptual design problems. First, the method is applied to the design of relational databases. It is shown that the method can be interpreted as a normalization procedure that maps a given relational schema into a new schema that represents an entity-relationship diagram. That is, the original schema has an interpretation in terms of higher-order concepts, which helps in understanding the semantics of the database it describes. The second design problem is related to the conversion of conventional file systems to the database approach. The method is used in this context to obtain a database conceptual schema from the description of the conventional system, which is one of the fundamental steps of the conversion process.
Article
We consider the problem of discovering the functional and inclusion dependencies that a given database instance satisfies. This technique is used in a database design tool that uses example databases to give feedback to the designer. If the examples show deficiencies in the design, the designer can directly modify the examples. the tool then infers new dependencies and the database schema can be modified, if necessary. the discovery of the functional and inclusion dependencies can also be used in analyzing an existing database. the problem of inferring functional dependencies has several connections to other topics in knowledge discovery and machine learning. In this article we discuss the use of examples in the design of databases, and give an overview of the complexity results and algorithms that have been developed for this problem. © 1992 John Wiley & Sons, Inc.
Article
A methodology for extracting an extended Entity-Relationship (EER) model from a relational database is presented. Through a combination of data schema and data instance analysis, an EER model is derived which is semantically richer and more comprehensible for maintenance and design purposes than the original database. Classification schemes for relations and attributes necessary for the EER model extraction are derived and justified. These have been demonstrated to be implementable in a knowledge-based system; a working prototype system which does so is briefly discussed. In addition, cases in which human input is required are also clearly identified. This research also illustrates that the database reverse engineering process can be implemented at a high level of automation.
Article
Inclusion dependencies, or INDs (which can say, for example, that every manager is an employee) are studied, including their interaction with functional dependencies, or FDs. A simple complete axiomatization for INDs is presented, and the decision problem for INDs is shown to be PSPACE-complete. (The decision problem for INDs is the problem of determining whether or not Σ logically implies a, given a set Σ of INDs and a single IND a). It is shown that finite implication (implication over databases with a finite number of tuples) is the same as unrestricted implications for INDs, although finite implication and unrestricted implication are distinct for FDs and INDs taken together. It is shown that, although there are simple complete axiomatizations for FDs alone and for INDs alone, there is no complete axiomatization for FDs and INDs taken together, in which every rule is k-mary for some fixed k (and in particular, there is no finite complete axiomatization.) Thus, no k-mary axiomatization can fully describe the interaction between FDs and INDs. This is true whether we consider finite implication or unrestricted implication. In the case of finite implication, this result holds, even if no relation scheme has more than two attributes, and if all of the dependencies are unary (a dependency is unary if the left-hand side and right-hand side each contain only one attribute). In the case of unrestricted implication, the result holds, even if no relation scheme has more than three attributes, each FD is unary, and each IND is binary.
Article
The dependency inference problem is to find a cover of the set of functional dependencies that hold in a given relation. The problem has applications in relational database design, in query optimization, and in artificial intelligence. The problem is exponential in the number of attributes. We develop two algorithms with better best case behavior than the simple one. One algorithm reduces the problem to computing the transversal of a hypergraph. The other is based on repeatedly sorting the relation with respect to a set of attributes.
Conference Paper
A set Σ of functional dependencies and inclusion dependencies implies a single dependency σ if all databases (finite and infinite) which satisfy Σ also satisfy σ. This paper presents complete inference rules for deducing implications of inclusion and functional dependencies. The results of [5] suggest that the implication problem for functional and inclusion dependencies together has no simple axiomatization satisfying a natural set of conditions. Out of necessity, the inference rules presented here do not satisfy the conditions assumed in [5].