Discovering the Semantics of Relational Tables Through Mappings
ABSTRACT Many problems in Information and Data Management require a semantic account of a database schema. At its best, such an account
consists of formulas expressing the relationship (“mapping”) between the schema and a formal conceptual model or ontology
(CM) of the domain. In this paper we describe the underlying principles, algorithms, and a prototype tool that finds such
semantic mappings from relational tables to ontologies, when given as input simple correspondences from columns of the tables to datatype properties of classes in an ontology. Although the algorithm presented is necessarily
heuristic, we offer formal results showing that the answers returned by the tool are “correct” for relational schemas designed
according to standard Entity-Relationship techniques. To evaluate its usefulness and effectiveness, we have applied the tool
to a number of public domain schemas and ontologies. Our experience shows that significant effort is saved when using it to
build semantic mappings from relational tables to ontologies.
Conference Proceeding: Discovering Semantically Similar Associations (SeSA) for Complex Mappings between Conceptual Models.Conceptual Modeling - ER 2008, 27th International Conference on Conceptual Modeling, Barcelona, Spain, October 20-24, 2008. Proceedings; 01/2008
Discovering the Semantics of Relational Tables
Yuan An1, Alex Borgida2, and John Mylopoulos1
1Department of Computer Science, University of Toronto, Canada
2Department of Computer Science, Rutgers University, USA
Abstract. Many problems in Information and Data Management require a se-
mantic account of a database schema. At its best, such an account consists of
formulas expressing the relationship (“mapping”) between the schema and a for-
mal conceptual model or ontology (CM) of the domain. In this paper we describe
the underlying principles, algorithms, and a prototype tool that finds such se-
mantic mappings from relational tables to ontologies, when given as input simple
correspondences from columns of the tables to datatype properties of classes in
an ontology. Although the algorithm presented is necessarily heuristic, we offer
formal results showing that the answers returned by the tool are “correct” for re-
lational schemas designed according to standard Entity-Relationship techniques.
To evaluate its usefulness and effectiveness, we have applied the tool to a number
of public domain schemas and ontologies. Our experience shows that significant
effort is saved when using it to build semantic mappings from relational tables to
Keywords: Semantics, ontologies, mappings, semantic interoperability.
1 Introduction and Motivation
A number of important database problems have been shown to have improved solutions
by using a conceptual model or an ontology (CM) to provide precise semantics for a
database schema. These3include federated databases, data warehousing , and infor-
mation integration through mediated schemas [13,8]. Since much information on the
web is generated from databases (the “deep web”), the recent call for a Semantic Web,
which requires a connection between web content and ontologies, provides additional
motivation for the problem of associating semantics with database-resident data (e.g.,
). In almost all of these cases, semantics of the data is captured by some kind of
semantic mapping between the database schema and the CM. Although sometimes the
mapping is just a simple association from terms to terms, in other cases what is required
is a complex formula, often expressed in logic or a query language .
For example, in both the Information Manifold data integration system presented in
 and the DWQ data warehousing system , formulas of the form T(X) :- Φ(X,Y )
?This is an expanded and refined version of a research paper presented at ODBASE’05 
3For a survey, see .
are used to connect a relational data source to a CM expressed in terms of a Descrip-
tion Logic, where T(X) is a single predicate representing a table in the relational data
source, and Φ(X,Y ) is a conjunctive formula over the predicates representing the con-
cepts and relationships in the CM. In the literature, such a formalism is called local-as-
view (LAV), in contrast to global-as-view (GAV), where atomic ontology concepts and
properties are specified by queries over the database .
In all previous work it has been assumed that humans specify the mapping formulas
– a difficult, time-consuming and error-prone task, especially since the specifier must
be familiar with both the semantics of the database schema and the contents of the on-
tology. As the size and complexity of ontologies increase, it becomes desirable to have
some kind of computer tool to assist people in the task. Note that the problem of seman-
tic mapping discovery is superficially similar to that of database schema mapping, how-
ever the goal of the later is finding queries/rules for integrating/translating/exchanging
the underlying data. Mapping schemas to ontologies, on the other hand, is aimed at un-
derstanding the semantics of a schema expressed in terms of a given semantic model.
This requires paying special attentions to various semantic constructs in both schema
and ontology languages.
We have proposed a tool that assists users in discovering mapping formulas be-
tween relational database schemas and ontologies in , and presented the algorithms
and the formal results. In this paper, we provide more detailed examples for explaining
the algorithms, and we also present proofs to the formal results. Moreover, we show
how to handle GAV formulas that are often useful for many practical data integra-
tion systems. The heuristics that underlie the discovery process are based on a careful
study of standard design process relating the constructs of the relational model with
those of conceptual modeling languages. In order to improve the effectiveness of our
tool, we assume some user input in addition to the database schema and the ontology.
Specifically, inspired by the Clio project , we expect the tool user to provide simple
correspondences between atomic elements used in the database schema (e.g., column
names of tables) and those in the ontology (e.g., attribute/”data type property” names
of concepts). Given the set of correspondences, the tool is expected to reason about the
database schema and the ontology, and to generate a list of candidate formulas for each
table in the relational database. Ideally, one of the formulas is the correct one — cap-
turing user intention underlying given correspondences. The claim is that, compared to
composing logical formulas representing semantic mappings, it is much easier for users
to (i) draw simple correspondences/arrows from column names of tables in the database
to datatype properties of classes in the ontology4and then (ii) evaluate proposed for-
mulas returned by the tool. The following example illustrates the input/output behavior
of the tool proposed.
Example 1.1 An ontology contains concepts (classes), attributes of concepts (datatype
properties of classes), relationships between concepts (associations), and cardinality
constraints on occurrences of the participating concepts in a relationship. Graphically,
we use the UML notations to represent the above information. Figure 1 is an enterprise
4In fact, there exist already tools used in schema matching which help perform such tasks using
linguistic, structural, and statistical information (e.g., [4,21]).
ontology containing some basic concepts and relationships. (Recall that cardinality con-
straints in UML are written at the opposite end of the association: a Department has at
least 4 Employees working for it, and an Employee works in one Department.) Suppose
s? u? p? e?r? v?i?s? i?o? n?
Employee(ssn, name, dept, proj)?
Fig.1: Relational table, Ontology, and Correspondences.
we wish to discover the semantics of a relational table Employee(ssn,name, dept,
proj) with key ssn in terms of the enterprise ontology. Suppose that by looking at col-
umn names and the ontology graph, the user draws the simple correspondences shown
as dashed arrows in Figure 1. This indicates, for example, that the ssn column corre-
sponds to the hasSsn property of the Employee concept. Using prefixes T and O to
distinguish tables in the relational schema and concepts in the ontology (both of which
will eventually be thought of as predicates), we represent the correspondences as fol-
T : Employee.ssn?O : Employee.hasSsn
T : Employee.name?O : Employee.hasName
T : Employee.dept?O : Department.hasDeptNumber
T : Employee.proj?O : Worksite.hasNumber
Given the above inputs, the tool is expected to produce a list of plausible mapping
formulas, which would hopefully include the following formula, expressing a possible
semantics for the table:
T :Employee(ssn,name,dept,proj) :-
O:Employee(x1), O:hasSsn(x1,ssn), O:hasName(x1,name), O:Department(x2),
O:works for(x1,x2), O:hasDeptNumber(x2,dept), O:Worksite(x3), O:works on(x1,x3),
Note that, as explained in , the above, admittedly confusing notation in the litera-
ture, should really be interpreted as the First Order Logic formula
(∀ssn,name,dept,proj) T :Employee(ssn,name,dept,proj) ⇒
(∃x1,x2,x3) O:Employee(x1) ∧...
because the ontology explains what is in the table (i.e., every tuple corresponds to an
employee), rather than guaranteeing that the table satisfies the closed world assumption
(i.e., for every employee there is a tuple in the table).
An intuitive (but somewhat naive) solution, inspired by early work of Quillian ,
is based on finding the shortest connections between concepts. Technically, this in-
volves (i) finding the minimum spanning tree(s) (actually Steiner trees5) connecting the
“corresponded concepts” — those that have datatype properties corresponding to ta-
ble columns, and then (ii) encoding the tree(s) into formulas. However, in some cases
the spanning/Steiner tree may not provide the desired semantics for a table because
of known relational schema design rules. For example, consider the relational table
Project(name, supervisor), where the column name is the key and corresponds to
the attribute O:Worksite.hasName, and column supervisor corresponds to the at-
tribute O:Employee.hasSsn in Figure 1. The minimum spanning tree consisting of
Worksite, Employee, and the edge works on probably does not match the semantics
of table Project because there are multiple Employees working on a Worksite ac-
cording to the ontology cardinality, yet the table allows only one to be recorded, since
supervisor is functionally dependent on name, the key. Therefore we must seek a
functional connection from Worksite to Employee, and the connection will be the
manager of the department controlling the worksite. In this paper, we use ideas of stan-
dard relational schema design from ER diagrams in order to craft heuristics that sys-
tematically uncover the connections between the constructs of relational schemas and
those of ontologies. We propose a tool to generate “reasonable” trees connecting the
set of corresponded concepts in an ontology. In contrast to the graph theoretic results
which show that there may be too many minimum spanning/Steiner trees among the
ontology nodes (for example, there are already 5 minimum spanning trees connecting
Employee, Department, and Worksite in the very simple graph in Figure 1), we ex-
pect the tool to generate only a small number of “reasonable” trees. These expectations
are born out by our experimental results, in Section 6.
As mentioned earlier, our approach is directly inspired by the Clio project [17,18],
which developed a successful tool that infers mappings from one set of relational tables
and/or XML schemas to another, given just a set of correspondences between their
respective attributes. Without going into further details at this point, we summarize the
contributions of this work:
– We identify a new version of the data mapping problem: that of inferring complex
formulas expressing the semantic mapping between relational database schemas
and ontologies from simple correspondences.
– We propose an algorithm to find “reasonable” tree connection(s) in the ontology
(key and foreign key structure), the ontology (cardinality restrictions), and standard
database schema design guidelines.
We show that if the schema was designed from a CM using techniques well-known
in the Entity Relationship literature (which provide a natural semantic mapping and
correspondences for each table), then the tool will recover essentially all and only
the appropriate semantics. This shows that our heuristics are not just shots in the
5A Steiner tree for a set M of nodes in graph G is a minimum spanning tree of M that may
contain nodes of G which are not in M.
dark: in the case when the ontology has no extraneous material, and when a table’s
scheme has not been denormalized, the algorithm will produce good results.
– To test the effectiveness and usefulness of the algorithm in practice, we imple-
mented the algorithm in a prototype tool and applied it to a variety of database
schemas and ontologies drawn from a number of domains. We ensured the schemas
and the ontologies were developed independently, and the schemas might or might
not be derived from a CM using the standard techniques. Our experience has shown
that the user effort in specifying complex mappings by using the tool issignificantly
less than that by manually writing formulas from scratch.
The rest of the paper is structured as follows. We contrast our approach with related
work in Section 2, and in Section 3 we present the technical background and notation.
Section 4 describes an intuitive progression of ideas underlying our approach, while
Section 5 provides the mapping inference algorithm. In Section 6 we report on the
prototype implementation of these ideas and experiments with the prototype. Section 7
shows how to filter out unsatisfied mapping formulas by ontology reasoning. Section 8
discusses the issues of generating GAV mapping formulas. Finally, Section 9 concludes
and discusses future work.
2 Related Work
The Clio tool [17,18] discovers formal queries describing how target schemas can
be populated with data from source schemas. To compare with it, we could view the
present work as extending Clio to the case when the source schema is a relational
database while the target is an ontology. For example, in Example 1.1, if one viewed the
ontology as a relational schema made of unary tables (such as Employee(x1)), binary
tables (such as hasSsn(x1,ssn)) and the obvious foreign key constraints from binary
lem. The desired mapping formula from Example 1.1 would not be produced for several
reasons: (i) Clio  works by taking each table and using a chase-like algorithm to re-
peatedly extend it with columns that appear as foreign keys referencing other tables.
Such “logical relations” in the source and target are then connected by queries. In this
particular case, this would lead to logical relations such as works for ?? Employee
?? Department, but none that join, through some intermediary, hasSsn(x1,ssn) and
hasDeptNumber(x2,dept), which is part of the desired formula in this case. (ii) The
fact that ssn is a key in the table T :Employee, leads us to prefer (see Section 4)
a many-to-one relationship, such as works for, over some many-to-many relation-
ship which could have been part of the ontology (e.g., O:previouslyWorkedFor);
Clio does not differentiate the two. So the work to be presented here analyzes the key
structure of the tables and the semantics of relationships (cardinality, IsA) to elimi-
nate/downgrade unreasonable options that arise in mappings to ontologies.
Other potentially relevant work includes data reverse engineering, which aims to
extract a CM,such asan ERdiagram, froma database schema. Sophisticated algorithms
and approaches to this have appeared in the literature over the years (e.g., [15,9]). The
major difference between data reverse engineering and our work is that we are given