Finding Semantic Mappings from Relational Tables to Ontologies/Conceptual Models
ABSTRACT Many problems in Information and Data Management re-quire a semantic account of a database schema. At its best, such an account consists of formulas expressing the relationship ("mapping") be-tween the schema and a formal conceptual model or ontology (CM) of the domain. In this paper we describe the underlying principles, algo-rithms, and a prototype of a tool which finds such semantic mappings from relational tables to ontologies, when given as input simple corre-spondences from columns of the tables to datatype properties of classes in an ontology. Although the algorithm presented is necessarily heuris-tic, we offer formal results showing that the answers returned by the tool are "correct" for relational schemas designed according to standard Entity-Relationship techniques. To evaluate its usefulness and effective-ness, we have applied the tool to a number of public domain schemas and ontologies. Our experience shows that significant effort is saved when us-ing it to build semantic mappings from relational tables to ontologies.
-
Citations (0)
-
Cited In (0)
Page 1
Finding Semantic Mappings from Relational Tables to
Conceptual Models/Ontologies
Yuan AnAlex Borgida John Mylopoulos
Computer Science
University of Toronto
Canada
yuana@cs.toronto.edu
Computer Science
Rutgers University
USA
borgida@cs.rutgers.edu
Computer Science
University of Toronto
Canada
jm@cs.toronto.edu
Abstract
Data integration is one of many problems
requiring a semantic account of a database
schema. At its best, such an account consists
of a formal mapping between the schema and a
formal conceptual model or ontology (CM) of
the domain. This paper describes the underly-
ing principles, algorithms, and a prototype of
a tool which proposes such semantic mappings
when given as input a relational schema, a
CM, and simple correspondences between ta-
ble columns and attributes of concepts in the
CM. Although the algorithm presented is nec-
essarily heuristic, we offer formal results stat-
ing that the answers returned are “correct” for
relational schemas designed according to stan-
dard Entity-Relationship techniques. We also
report on experience in using the tool with
public domain schemas and ontologies.
1
A number of important database problems have been
shown to have improved solution when one has avail-
able a representation of the precise semantics of the
database schema. These include view integration, fed-
erated databases, data warehousing [2], and especially
information integration through mediated schemas
[11]. (See survey [22].) Since much information on the
web is generated from databases (the “deep web”), the
recent call for a Semantic Web, which requires a con-
nection between web content and ontologies, provides
additional motivation for the problem of associating
semantics with data (e.g., [7]). In almost all of these
cases semantics of the data is captured by a conceptual
model or ontology (CM) of the problem domain, and
some kind of semantic mapping between the database
schema and the CM. Although sometimes the map-
ping is just a simple association from terms to terms,
in other cases [11, 2, 1] what is required is a complex
formula, often expressed in logic or a query language.
Introduction and Motivation
So far, it has been assumed that humans provide
both the CM, and the semantic mapping between the
CM and database schema. Unfortunately, this involves
two steps, both of which are time-consuming and error
prone: (i) building the CM; (ii) expressing the connec-
tion of the database schema to the CM as a mapping
formula. For the first problem, we suggest reusing for-
mal CMs that have already been developed. This view
is particularly appropriate for the semantic web, where
large ontologies, such as Cyc [15], are developed in-
dependently of any particular application. The same
is true for information integration/data warehousing,
where new databases may want to “join” after a CM
has already been developed for the domain of dis-
course. For the second problem, we propose to build
tools that assist users in finding complex mapping for-
mulas — the subject of this paper.
Setting: To be explicit, we assume a situation where
we are given a relational schema consisting of a set of
relational tables, and a formal, independently devel-
oped CM, which can be potentially large (e.g., thou-
sands of concepts). Our tool has users who are familiar
with the database schema whose semantic mapping is
being sought (both its schema and contents), but who
are unfamiliar with the details of the CM, and are
uncomfortable with writing complex logical formulas
defining mappings. Note that the CM is not assumed
to be exactly about the same domain of discourse as
the database: it could be much richer, as would be the
case with large ontologies.
Approach: The scenario we imagine is interactive,
and involves tool support for several phases: (1) find-
ing and expressing simple correspondences between re-
lational table columns and CM components (mostly
string/integer-valued attributes), which supply data
to be stored in those columns; (2) providing a list of
candidate formulas (expressed as conjunctive queries)
that express the mappings between the table and the
CM; (3) if necessary, supporting the user in inspecting,
pruning and possibly editing these candidate mappings
to refine them.
Page 2
-hasSsn?
-hasName?
-hasAddress?
-hasAge?
Employee?
-hasDeptNumber?
-hasName?
-.?
-.?
Department?
-hasNumber?
-hasName?
-.?
-.?
Worksite?
works_for?
manages?
controls?
4..*?1..1?
0..1?
0..*?
1..*?
1..1?
1..1?
supervision?
works_on?
0..1?
0..1?
0..*?
Figure 1: Company CM.
Our approach is directly inspired by the Clio project
[14, 17], which developed a successful tool that in-
fers mappings from one set of relational tables and/or
XML documents to another, given just a set of corre-
spondences between their respective attributes. As in
[17], we focus here on step (2) of the process described
above, since there is considerable existing work con-
cerning the first step, in the area of schema match-
ing [19]. Additional evidence for the importance of
supporting steps other than just (1) is provided by
the results of a survey concerning the distribution of
time/effort in solving data integration problems [21].
The following example illustrates the input/output be-
havior of the tool we seek:
Example 1.1 Given the CM in Figure 1 (expressed
here in UML notation), relational table Emp(ssn,
name, dept, proj) with key ssn, and the (natural) cor-
respondences
T : Emp.ssn?O : Employee.hasSsn
T : Emp.name?O : Employee.hasName
T : Emp.dept?O : Department.hasDeptNumber
T : Emp.proj?O : Worksite.hasNumber
we may expect a mapping of the form
T :Emp(ssn, name, dept, proj):-
O:Employee(x1), O:hasSsn(x1,ssn),
O:hasName(x1,name), O:Department(x2),
O:works for(x1,x2), O:hasDeptNumber(x2,dept),
O:Worksite(x3), O:works on(x1,x3),
O:hasNumber(x3,proj).
where, for clarity, we used here prefixes T and O to
distinguish predicates in the relational schema and the
CM. ?
Note that, as mentioned, the problem is inherently
under-specified: in the example above, the table could
just as well have represented the manages relationship,
instead of works for, so we currently propose to return
all such semantic mappings. Without going into fur-
ther details at this point, we summarize the contribu-
tions which we feel are being made here:
• The paper identifies a new version of the data
mapping problem: that of discovering formulas
expressing the semantic mapping between rela-
tional database schemas and formal CMs, en-
coded as graphs. (Exactly such mapping formulas
are used, for example, in the DWQ system [2].)
• It then proposes a (necessarily heuristic) solu-
tion to this problem, whose most basic heuristic
is a version of Occam’s razor: finding something
close to a minimal connection between the con-
cepts that have attributes corresponding to ta-
ble columns. Formally, these connections are so-
called Steiner trees.
• The algorithm is enhanced to take into account in-
formation about the schema (key and foreign key
structure), the ontology (cardinality restrictions),
and standard database schema design guidelines.
• To gain empirical confidence, the proposed algo-
rithms have been implemented, and evaluated in
a number of experiments.
• To gain theoretical confidence, we describe for-
mal results which state that if the schema was
designed from a CM using techniques well-known
in the Entity Relationship literature (which pro-
vide a natural semantic mapping for each table),
then the tool will report essentially all and only
the appropriate semantics. This shows that our
heuristics are not just shots in the dark: in the
case when the CM has no extraneous material,
and when a table’s schema has not been denor-
malized, the algorithm will produce reasonable re-
sults.
The rest of the paper is structured as follows. Sec-
tion 2 discusses related work, and Section 3 presents
the necessary background and notation. Section 4 de-
scribes an intuitive progression of ideas underlying our
approach, while Section 5 provides the mapping dis-
covery algorithm. In Section 6 we report on the pro-
totype implementation of these ideas and experience
with the prototype. Finally, Section 7 summarizes the
contributions of this work and discusses directions for
future research.
2Related Work
The problem of discovering semantic connections dates
as far back as Ross Quillian’s PhD thesis [18], where
a program, given noun-noun pairs, finds semantic
paths between them within a semantic network. Such
paths were used to generate English descriptions of the
meaning for each pair.
The problem of data reverse engineering is to ex-
tract an ER diagram, for example, from a database
schema. Sophisticated algorithms and approaches to
this have appeared in the literature over the years (e.g.,
[12, 6]). The major difference between data reverse en-
gineering and our work is that we are given an existing
CM, and want to interpret a legacy relational schema
in terms of it, whereas data reverse engineering aims
to construct a new CM.
Schema matching (e.g., [19]) identifies semantic re-
lations between schema elements based on their names,
data types, constraints, and schema structures. Re-
cent work on iMAP [4], discovers not only 1-1 matches
between pairs of elements, but also complex matches
Page 3
that specify how some combinations (e.g., concate-
nation) of elements correspond in the two schemas.
iMAP concentrates on finding sophisticated formulas
involving mostly single tables, though its multi-modal
algorithm examines such hints as stored queries and
extensional data to also discover join paths between
tables. The present work can be thought of as focus-
ing exclusively on finding such join paths, but based on
table structure and semantic information in the CM,
since neither queries nor instances are normally avail-
able for ontologies.
Relationship to Clio.As mentioned earlier, the
Clio tool [14, 17] discovers formal queries describ-
ing how target schemas can be populated with data
from source schemas.The present work could be
viewed as extending this to the case when the source
schema is a relational database, while the target is a
CM. If one viewed the conceptual model as a rela-
tional schema made of unary and binary tables (e.g.,
O:Employee(X), O:hasSsn(X,ssn)), one could in fact
try to apply directly the Clio algorithm to Example
1.1, pushing it beyond its intended application do-
main. The desired mapping formula from Example
1.1 would not be produced for several reasons: (i) Clio
[17] does not make a so-called logical table connect-
ing O:hasSsn and O:hasDeptNumber, since the chase
algorithm only follows foreign key references out of ta-
bles. (ii) The fact that ssn is a key by itself, leads
us to prefer (see Section 4) a many-to-one relation-
ship, such as works for, over some many-to-many re-
lationship O:previouslyWorkedFor, which could have
been part of the CM; Clio does not differentiate the
two. So the work to be presented here analyzes the
key structure of the tables and the semantics of rela-
tionships (cardinality, IsA) to eliminate unreasonable
options that arise in mapping to CMs.
3 Formal Preliminaries
We assume the reader is familiar with standard no-
tions of relational databases, as presented in [20] for
example. We will use the notation T[K,Y ] to rep-
resent a relational table T with columns KY , and
key K. (When we view tables as predicates, we will
sort the arguments in alphabetical order.) If neces-
sary, we will refer to the individual columns in Y us-
ing Y [1],Y [2],..., and use XY as concatenation. Our
notational convention is that single column names are
either indexed or appear in lower-case. Given a table
such as T above, we use the notation key(T), nonkey(T)
and columns(T) to refer to K, Y and KY respectively.
(Note that we use the terms “table” and “column”
when talking about relational schemas, reserving “re-
lation(ship)” and “attribute” for aspects of the CM.)
A foreign key (fk) in T is a set of columns F that ref-
erences table T?, and imposes a constraint that the
projection of T on F is a subset of the projection of
T?on key(T?).
We use a generic conceptual modeling language
(CML), which contains common aspects of most se-
mantic data models, UML, ontology languages such
as OWL [16], and description logics.
the language allows the representation of concepts
(unary predicates over individuals), object proper-
ties/relationships (binary predicates relating individ-
uals), and datatype properties/attributes (binary pred-
icates relating individuals with values such as integers
and strings); attributes are single valued in this paper.
Concepts are organized in the familiar is-a hierarchy.
Object properties, and their inverses (which are always
present), are subject to constraints such as specifica-
tion of domain and range, plus the familiar cardinality
constraints, which here allow 1 as lower bounds (called
total relationships), and 1 as upper bounds (called
functional relationships). We shall represent a given
CM using a directed ontology graph, which has con-
cept nodes labeled with concept names C, and edges
labeled with object properties p; for each such p, there
is an edge for the inverse relationship, referred to as
p−. For each attribute f of concept C, we create a sep-
arate attribute node Nf,C, whose label is f, and with
edge labeled f from C to Nf,C.1In figures, we follow
UML, drawing concepts as rectangles, with attributes
listed inside.
We propose to have edge p from C to B, written
in the text as C ---p--- B, to represent that p has
domain C and range B. (If the relationship p is func-
tional, we write C ---p->-- B.) For expressive CMLs
such as OWL, we may also connect C to B by p if
we find an existential restriction stating that each in-
stance of C is related to some instance of B by p. The
significant feature of the resulting graph is that if C
and B (or their super-classes) are not linked by edge
p, then a ’join’ formula A(x),p(x,y),B(y) will not be
satisfied for any x,y, and hence should not appear in
any semantic mapping.
In this paper, a correspondence T.c ?E.f will re-
late column c of table T to attribute f of concept
E. Since our algorithms deal with ontology graphs,
formally a correspondence L will be a mathemati-
cal relation L(T,c,E,f,Nf,E), where the first two ar-
guments determine unique values for the last three.
Given L, T, and c, we will frequently want to find E;
for this, we use the derived function oncL(c,T), and
since the context will always identify L and T, we will
use the expression onc(c). We extend onc to sets of
columns in a point-wise manner. Therefore the set
M=onc(columns(T1)) is the set of entities in the CM
such that each has some attributes corresponding to
a column of T1 according to the correspondences im-
plicit in the context. Finally, a semantic mapping has
the form of a Horn-clause T(X) : −ψ(X,Y ), where T
is a table with columns X (which become arguments
Specifically,
1Unless ambiguity arises, we will use “node C”, when we
mean “concept node labeled C”.
Page 4
to its predicate), and ψ is a conjunctive formula over
predicates representing the CM, with Y existentially
quantified as usual.
Using the terminology of [11, 10], and following the
example of [11, 2], we are looking for a Local-as-View
(LAV) mapping connecting the schema with the CM,
partly because our tool is particularly well suited to
the incremental connection of databases to conceptual
models, which is associated with the LAV approach.
In the future, we plan to also explore the discovery of
GAV and GLAV mappings.
4
As mentioned in Section 2, Quillian [18] was searching
for semantic connections between word senses/nodes
in a semantic network. His solution relied on find-
ing shortest paths between them. Similarly, we are
starting from the concepts/nodes in the set M =
{onc(c)|c ∈ columns(T)}, and are trying to discover
semantic connections between them, so it makes sense
to look for “shortest” connections between the nodes
in M. A spanning tree SM for M in the graph of the
CM captures this notion, in the sense that it uses the
minimal number of edges necessary to connect M, and
does not introduce any extraneous concepts.
Thus,for Example 1.1,
as onespanning tree,
Employee ---works_for-->- Department,
Employee ---works_on-->- Worksite,
forthe attributes inthe
as
Employee ---hasSsn-->- NhasSsn,E,
NhasSsn,Eis an attribute node labeled with hasSsn.
A spanning tree S for table T[Y ], based on cor-
respondence L, gives rise to a conjunctive formula
according to the procedure encodeTree(S,L) below,
which basically assigns variables to nodes, and con-
nects them using edge labels as predicates:
Function encodeTree(S,L)
input: subtree S of ontology graph, correspondences L
from table columns to attributes of concept nodes in S.
output: variable name generated for root of S, and con-
junctive formula for the tree.
steps:
1. Suppose N is the root of S. Let Ψ = {}.
2. if N is an attribute node with label f,
find d such that L( ,d, ,f,N) = true, return
(d,true). /*for leafs use corresp. column names*/
3. if N is a concept node with label C,
then introduce new variable x; add conjunct
C(x) to Ψ;
for each edge pi from N to Ni
let Si be the subtree rooted at Ni,
let (vi,φi(Zi))= encodeTree(Si,L),
add conjuncts pi(x,vi) ∧ φi(Zi) to Ψ;
return (x,Ψ).
Continuing with the example above,
encodeTree(S1
x3
toEmployee,Department,
Principles of Mapping Discovery
we would get S1
where
S1
emp
emp
has edges
plus edges
correspondence, such
where
suppose
emp,L) assigns variables x1, x2, and
andWorksite,
respectively;
hand side of the Horn-clause in Example 1.1,
where,forexample,
works for(x1,
resentstheedge
Employee ---works_for-->-
Department, while hasSsn(x1,ssn) represents the
edge Employee ---hasSsn-->- NhasSsn,E,
of the correspondence Emp.ssn?Employee.hasSsn.
As noted before, there are alternate semantics, based
on different spanning trees, which often arise due to
multiple edges between concepts.
To reduce the complexity of the algorithms, and
the size of the answer set, we modify the graph by col-
lapsing multiple edges between nodes E and F, labeled
p1,p2,... say, into a single edge labeled?p1;p2;...?The
idea is that it will be up to the user to choose between
the alternative labels in phase (3) of the scenario de-
scribed in the Section 1, though the system may offer
suggestions, based on additional information, such as
heuristics concerning the identifiers labeling tables and
columns, and their relationship to property names.
Consider however the case when T[c,b] is a table
with key c, corresponding to attribute c on C, and b
is a foreign key corresponding to b on B, where b and
c are unique identifier attributes within their classes.
Then for each value of c (and hence instance of C),
T associates at most one value of b (instance of B).
Hence the semantic mapping for T should be some
formula that acts as a function from its first to its sec-
ond argument. Trees for such formulas look like func-
tional edges, and hence should be preferred. For ex-
ample, given table Dep[dept,ssn,...], and correspon-
dences like in Example 1.1, the proper spanning tree
uses manages−rather than works for−.
Conversely, for table T?[c,b], an edge that is func-
tional from C to B, or from B to C, is likely not
to reflect a proper semantics since it would mean
that the key chosen for T?is actually a super-key –
an unlikely error. (In our example, consider a table
T[ssn,dept], where both columns are foreign keys.).
To deal with such problems, our algorithm will work in
two stages: first connecting the concepts correspond-
ing to key columns into a skeleton tree, then connecting
the remaining columns to the skeleton by, preferably,
functional edges.
Finally, we must deal with our assumption that the
CM is developed independently, which implies that not
all parts of it are reflected in the database schema.
This complicates things, since in building the span-
ning tree we may need to go through additional nodes,
which end up not being present in the database. For
example, consider the table Project (name, supervi-
sor, ...), with name as key, and correspondences
T : Project.name ? O : Worksite.hasName, and
T : Project.supervisor ? O : Employee.hasSsn,
to the CM in Figure 1. Based on the argument above,
the edge works on−, connecting Worksite to Employee
is inappropriate because it is not functional. Instead,
it willthengenerate the right-
x2) rep-
because
Page 5
we prefer the functional path
passing through node Department, even if it is longer.
We therefore modify our algorithm to look for so-
called Steiner-trees
ordinary spanning trees.
ulo other considerations such as key and functional
paths.)Similar situations arise when the CM con-
tains detailed aggregation hierarchies (e.g., city part-
of township part-of county part-of state), which are
abstracted in the database (e.g., a table with columns
for city and state only).
We have chosen to flesh out the above principles
in a systematic manner by considering the behavior of
our proposed algorithm on relational schemas designed
from Entity Relationship diagrams — a topic widely
covered in even undergraduate database courses [20].
(We call this er2rel schema design.) One benefit of
this approach will be to allow us to prove that our al-
gorithm, though heuristic in general, is in some sense
“correct” for a certain class of schemas. Of course, in
practice such schemas may be “denormalized” in order
to improve efficiency, and, as we mentioned, only parts
of the CM are realized in the database. We emphasize
that our algorithm uses the general principles enun-
ciated above even in such cases, with relatively good
results in practice.
controls−.manages−,
2(s-trees for short), instead of
(This is, of course, mod-
5
We will introduce the details of the algorithm in a
gradual manner, by repeatedly adding features of an
ER model that appear as part of the CM. We assume
that the reader is familiar with basics of ER modeling
and database design [20], though we summarize the
ideas.
Mapping Discovery Algorithms
5.1An Initial Subset of ER notions
We start with a subset of ER that contains the notions
such as entity set (called just “entity” here), with at-
tributes, and binary relationship set. In order to facil-
itate the statement of correspondences and theorems,
we assume in this section that attributes in the CM
have globally unique names. (Our implemented tool
does not make this assumption.) An entity is repre-
sented as a concept in our CM. A binary relationship
set corresponds to two relationships in our CM, one
for each direction, though only one is mapped to a ta-
ble. Such a relationship will be called many-many if
neither it nor its inverse is functional. A strong en-
tity S has some attributes that act as identifier. We
shall refer to these using unique(S) when describing the
rules of schema design. A weak entity W has instead
2A Steiner-tree for set M of nodes in graph G is a minimum-
weight subgraph of G that contains all the nodes in M and is
a tree. In the case when M is large, computing an s-trees may
become too expensive since the problem is NP-hard. However,
we will only compute s-trees for the concepts corresponding to
columns in keys, and key cardinality is small in practice.
localUnique(W) attributes, plus a functional total bi-
nary relationship p (preferred to with idRel(W)) to an
identifying owner entity (preferred to with idOwn(W)).
Note that information about general identification
cannot be represented in even highly expressive lan-
guages such as OWL. So functions like unique are only
used while describing the er2rel mapping, and are not
assumed to be available during semantic recovery. The
er2rel design methodology (we follow mostly [12, 20])
is defined by two components: To begin with, Table 1
specifies a mapping τ(O) returning a relational table
schema for every CM component O, where O is either
a concept/entity or a binary relationship. In this sub-
section, we assume that no pair of concepts is related
by more than one relationship, and that there are no
so-called “recursive” relationships relating an entity to
itself. (We deal with these in Section 5.3.)
ER Model object O
Strong Entity S
Relational Table τ(O)
columns:
X
primary key:
K
Let X=attribs(S)
Let K=unique(S)
fk’s:
none
anchor:
S
semantics:
T(X):-S(y),hasAttribs(y,X).
identifier:
identifyS(y,K):-S(y),hasAttribs(y,K).
columns:
Weak Entity W
let
Z=attribs(W)
X = key(τ(idOwn(W)))
U =localUnique(W)
E = idOwn(W)
V = Z − U
ZX
primary key:
UX
fk’s:
X
W
anchor:
semantics:
T(X,U,V ):-
W(y), hasAttribs(y,Z),E(w),
idrel(W)(y,w),identifyE(w,X).
identifier:
identifyW(y,XU):-
W(y),E(w),idrel(W)(y,w),
hasAttribs(y,U),identifyE(w,X).
columns:
Functional Relationship F
E1–F->-E2
X1X2
primary key:
X1
fk’s:
Xi references τ(Ei),
let Xi= key(τ(Ei))
for i = 1,2
anchor:
E1
semantics:
T(X1,X2):-
E1(y1),identifyE1(y1,X1), F(y1,y2),
E2(y2), identifyE2(y2,X2).
columns:
Many-many
Relationship M
E1–M–E2
let Xi= key(τ(Ei))
for i = 1,2
X1X2
primary key:
X1X2
fk’s:
Xi references τ(Ei),
semantics:
T(X1,X2):-
E1(y1),identifyE1(y1,X1), M(y1,y2),
E2(y2), identifyE2(y2,X2).
Table 1: er2rel Design Mapping.