Translating XML Web Data into Ontologies.
-
Citations (0)
-
Cited In (0)
Page 1
Translating XML Web Data into Ontologies
Yuan An and John Mylopoulos
University of Toronto, Canada
{yuana,jm}@cs.toronto.edu
Abstract. Translating XML data into ontologies is the problem of find-
ing an instance of an ontology, given an XML document and a specifi-
cation of the relationship between the XML schema and the ontology.
Previous study [8] has investigated the ad hoc approach used in XML
data integration. In this paper, we consider to translate an XML web
document to an instance of an OWL-DL ontology in the Semantic Web.
We use the semantic mapping discovered by our prototype tool [1] for the
relationship between the XML schema and the ontology. Particularly, we
define the solution of the translation problem and develop an algorithm
for computing a canonical solution which enables the ontology to answer
queries by using data in the XML document.
1 Introduction
XML has become an accepted standard for publishing data on the Web. To
integrate XML data, a former paper [8] has studied the ad hoc approach to
translating various XML documents into a central ontology instance. In this
paper, we study a generic and a formal framework for translating an XML doc-
ument into an instance of an ontology. The following example illustrates the
problem. Suppose we have an XML document X:
<db>
<student sname=’’Jerry’’>
<takes>
<course title=’’Database Theory’’/>
<course title=’’Combinatorial Optimization’’/>
</takes>
<advisor pname=’’John’’/>
</student>
</db>
Suppose we have an ontology shown graphically in Figure 1 using UML
notation. Given a natural mapping semantically relating the XML schema to the
ontology, we would expect that an instance of the ontology contains the following
assertions: Student(t1), Course(t2), Course(t3), Professor(t4), hasName(t1,”Jerry”),
hasTitle(t2,”Data base Theory”), hasTitle(t3,”Combinatorial Optimization”), hasName
(t4,”John”), takes(t1,t2), takes (t1,t3), hasAdvisor(t1,t4), Professor(u1), Professor(u2),
Course(u3), teaches(u1,t2), teaches(u2, t3), teaches(t4,u3), where tii = 1,...,4 and
ujj = 1,...3 are anonymous individuals in the ontology.
Page 2
Note that the difference between tis and ujs is that the original XML doc-
ument provides no information about the individuals u1, u2and u3. They were
deduced by the ontology constraints. However, if we replace u1and u2by t4, then
the resulting instance will still satisfy all constraints and it says that professor
t4teaches both courses t2and t3. Alternatively, we could also construct an in-
stance in which the professor t4teaches only the course t2, while the course t3is
taught by some unknown professor u2. This tells us that there could be different
instances that are consistent with the ontology and satisfy a given mapping from
the XML schema to the ontology. So if we are given a source document X shown
above and a query over the ontology, how can we answer it? If our query is, for
example, What is the name of the person who is the advisor of the person whose
name is Jerry? The answer is John regardless of a particular instance that was
created for the ontology. As another example, consider the query What is the
title of the course taught by Jerry’s advisor? This query cannot be answered
with certainty in this scenario.
Ontologies play a central
role in the Semantic web. Re-
cently, W3C has recommended
the OWL web ontology lan-
guage for describing ontolo-
gies in the Semantic Web. If
an XML document needs to
be translated into an OWL
ontology, the resulting ontol-
ogy should preserve the infor-
mation in the XML document and be able to answer queries by using these infor-
mation. Consequently, a translation involves specifying a mapping, checking the
consistency, and preserving information. In this paper, we consider the OWL-DL
ontology language because of its close relationship with Description Logics. As a
result, the OWL-DL ontology language enable us to develop a translation algo-
rithm. The overall framework is generic in the sense that the theoretical issues
apply to many translation problems between databases and ontologies.
The rest of the paper is organized as follows. Section 2 presents the formal
specifications about OWL-DL ontology and XML. Section 3 defines the problem
and Section 4 defines the canonical solution. Section 5 develops the algorithm,
and finally, Section 6 gives the conclusions.
-hasName?
Person?
-hasName?
Student?
-hasName?
Professor?
-hasTitle?
Course?
hasAdvisor?
takes?
teaches?
1..*?
1..*?
1..*? 1..*?
1..1?
0..*?
Fig.1. An Ontology
2Preliminaries
We assume readers are familiar with the standard notations and semantics of De-
scription Logics, though we summarize here one flavor relating to the OWL-DL
web ontology language. OWL-DL is closely related to the SHOIN(D) descrip-
tion logic [5], and the meanings of its terminology can be found in [4,5].
A datatype theory D is a mapping from a set of datatypes to a set of values.
The datatype (or concrete) domain, written ∆I
D, is the union of the mappings
Page 3
of the datatypes. Let R be set of role names consisting of a set of abstract role
names RAand a set of concrete role names RD. The set of SHOIN-roles (or
roles for short) consist of a set of abstract roles RA∪{R−|R ∈ RA} and a set
of concrete roles RD. An RBox R consists of a finite set of transitivity axioms
Trans(R), and role inclusion axioms of the form R ? S and T ? U, where R and
S are abstract roles, and T and U are concrete roles. ?∗denotes the reflexive-
transitive closure of ? on roles, i.e., for two abstract roles R, S, S ?∗R∈R if
S and R are the same, S ? R ∈ R, Inv(S) ? Inv(R) ∈ R, or there exists some
role Q such that S ?∗Q ∈ R and Q ?∗R ∈ R. A role not having transitive
sub-roles is called a simple role, and Inv(R) = R−.
The set of SHOIN(D) concepts is defined by the following syntactic rules,
where Cis are concepts, A is an atomic concept, R is an abstract role, S is an
abstract simple role, T is a concrete role, oi are individuals, D is a datatype,
and n is a non-negative integer:
C → A|¬C|C1? C2|C1? C2|∃R.C|∀R.C| ≥ nS| ≤ nS|
{o1,..,on}| ≥ nT| ≤ nT|∃T.D|∀T.D
A TBox T consists of a finite set of concept inclusion axioms C1? C2; an
ABox A consists of a finite set of concept and role assertions and individual
(in)equalities C(a), R(a,b), a = b, a ?= b, respectively. A SHOIN(D) knowl-
edge base (an ontology) O = (T ,R,A) consists of a TBox T , an RBox R, and
an ABox A. The semantics of SHOIN(D) is given by means of an interpre-
tation I = (∆I,·I) consisting of an non-empty domain ∆I, disjoint from the
datatype domain ∆I
concepts, roles, axioms, and assertions in the standard description logic way. An
interpretation I is a model of the knowledge base O = (T ,R,A) if I satisfies
every concept, axiom, and assertion in O. From the database perspective, the
TBox T and the RBox R can be viewed as a schema with unary and binary
relational tables, and the ABox A can be viewed as an instance. An ABox A
is consistent with respect to O if there is a model of (T ,R,A) (we say O is
consistent). A concept or role assertion β is a logical consequence of an ABox A
(written A |= β), if for every model of A w.r.t < T ,R >, β is true. We write
O |= β for β is a logical consequence of the ontology.
An XML document is typically modeled as a node-labeled tree. For our pur-
pose, we assume that each XML document is described by an XML schema
consisting of a set of element and attribute type definitions. Specifically, we as-
sume the following countably infinite disjoint sets: Ele of element names, Att
of attribute names, and Dom of simple type names including the built-in XML
schema datatypes. Attribute names are preceded by a ”@” to distinguish them
from element names. Given finite sets E ⊂Ele and A ⊂Att, a XML schema
S = (E,A,τ,ρ,κ) specifies the type of each element ? in E, the attributes that ?
has, and the datatype of each attribute in A. Specifically, An element type τ is de-
fined by the grammar τ ::= ?|Sequence[?1: τ1,...?n: τn]|Choice[?1: τ1,..,?n: τn]
(?1,..,?n∈ E), where ? is for the empty type, and Sequence and Choice are com-
plex types. Each element associates an occurrence constraint with two values:
D, and a mapping ·I, which interprets atomic and complex
Page 4
minOccurs indicating the minimum occurrence and maxOccurs indicating the
maximum occurrence. The set of attributes of an element ? ∈ E is defined by the
function ρ : E → 2A; and the function κ : A →Dom specifies the datatypes of
attributes in A. Each datatype name associates with a set of values in a domain
Dom. In this paper, we do not consider the simple type elements (correspond-
ing to DTD’s PCDATA ), assuming instead that they have been represented
using attributes. Furthermore, a special element r ∈ E is the root of the XML
schema such that ρ(r) = ∅, and we assume that for any two element ?i,?j∈ E,
ρ(?i) ∩ ρ(?j) = ∅.
An XML document X = (N,<,r,λ,η) over (E,A) consists of a set of nodes
N, a child relation < between nodes, a root node r, and two functions such as:
– a labeling function λ:N → E ∪A such that if λ(v) = ? ∈ E, we say that v is
in the element type ?; if λ(v) = @a ∈ A, we say that v is an attribute @a;
– a partial function η:N → Dom for every node v with λ(v) = @a ∈ A,
assigning values in domain Dom that supplies values to simple type names
in Dom.
An XML document X = (N,<,r,λ,η) conforms to a schema S = (E,A,τ,ρ,κ),
denoted by X |= S, if:
1. for every node v in X with children v1,..,vm such that λ(vi) ∈ E for i =
1,...,m, if λ(v) = ?, then λ(v1),..., λ(vm) satisfies τ(?) and the occurrence
constraints.
2. for ever node v in X with children u1,...,unsuch that λ(ui) = @ai∈ A for
i = 1,...,n, if λ(v) = ?, then λ(ui) = @ai∈ ρ(?), and η(ui) is a value having
datatype κ(@ai).
Now we turn to the mapping language relating a pattern in an XML schema with
a formula in an ontology. On the XML side, the basic component is attribute for-
mulas [2], which are specified by the syntax α ::= ?|?(@a1= x1,..,@an= xn),
where ? ∈ E, @a1,..,@an ∈ A, E and A are element names and attribute
names respectively; and variables x1,..,xn are the free variables of α. Tree-
pattern formulas over an XML schema S = (E,A,τ,ρ,κ) are defined by ψ ::=
α|α[ϕ1,..,ϕn], where α ranges over attribute formulas over (E,A). The free vari-
ables of a tree formula ψ are the free variables in all the attribute formulas that
occur in it. For example, Company[Department[employee(@eid = x1)[manager
(@mid = x2) [employee( @eid = x3) ]]]] is a tree formula.
An attribute formula is evaluated in a node of an XML document, and values
for free variables come from domain Dom. If X is an XML document over (E,A)
and v a node of X, then
– (X,v) |= ? iff λ(v) = ?, for ? ∈ E.
– if α(x1,...,xn)= ?(@a1= x1,..,@an= xn), then (X,v) |= α(s1,...,sn), where
s1,...,sn∈ Dom, iff λ(v) = ?, and for each child viof v such that λ(vi) = @ai,
η(vi) = sifor i ∈ [1,..n].
Page 5
Given a document X, a tree-pattern formula ψ(x), and a tuple s from Dom,
ψ(s) is satisfied in X (written X |= ψ(s)) if there is a witness node v for ψ(s).
Formally, a witness node for a ψ(s) is defined as follows:
– v is a witness node for α(s), where α is an attribute formula, iff (X,v) |= α(s).
– v is witness node for α(s)[ ψ1(s1),...,ψm(sm)] iff (X,v) |= α(s) and there are
m children v1,...,vmof v such that each viis a witness node for ψi(si), for
i = 1,...,m.
On the ontology side, we use conjunctive formulas with annotations, which treat
atomic concepts and roles as unary and binary predicates, respectively. For ex-
ample, given an ontology containing the atomic concept Employee and roles
hasId, hasManager, and manages, the following is a mapping formula,
Company[Department[
employee(@eid = x1)[
manager (@mid = x2) [
employee (@eid=x3) ]]]] →
Employee(Y1), hasId(Y1,x1), Employee(Y2), hasId(Y2,x2),
hasManager(Y1,x2), Employee(Y3), hasId(Y3,x3), manages(Y2,Y3)::
identif(Y1,x1), identif(Y2,x2), identif(Y3,x3).
There are two sorts of variables. One sort of variables denoted, e.g., by Yis,
represent the individuals in the ontology, and another sort of variables denoted,
e.g., by xjs, represent data values containing the attribute values in the XML
document and concrete values in the ontology. Since attribute values in the XML
document come from the domain Dom, while concrete values in the ontology
come from domain ∆I
D, we assume that each mapping formula implies a set
of conversion functions such that when the single variable name xj is used on
both sides, both datatypes in the corresponding positions are matched through
an implicit conversion function. We denote by ConstValue the set of all data
values that occur in the XML document and we also call them constant values.
In addition, we assume an infinite set VarValue which we call variable values
including an infinite set Individual of individuals and an infinite set DataValue
of data values. We require that ConstValue∩VarValue=∅.
The annotation comes after :: in the mapping formula. Each predicate in
the annotation is of the form identif(Y,Z) in which Y is an individual variable
and Z is a tuple of variables. The meaning of identif(Y,Z) is as follows. The
information in XML document indicates that an individual belonging to the
concept C in which Y is the placeholder variable, i.e, C(Y ) appearing in the
formula, can be identified by a set of roles P1,...,Pn in the ontology, whereas
P1,...,Pnbind Y with Z in the formula, i.e., P1(Y,Z1),...,Pn(Y,Zn) appear. We
will see later that the annotation is important during the translation and for
consistency checking in the ontology. To specify the mapping formulas, we have
proposed a semi-automatic tool MAPONTO in [1].
3 The Problem of Translating XML data into Ontologies
We now define the problem of translating XML into ontologies (X-to-O).