Page 1

Translating XML Web Data into Ontologies

Yuan An and John Mylopoulos

University of Toronto, Canada

{yuana,jm}@cs.toronto.edu

Abstract. Translating XML data into ontologies is the problem of find-

ing an instance of an ontology, given an XML document and a specifi-

cation of the relationship between the XML schema and the ontology.

Previous study [8] has investigated the ad hoc approach used in XML

data integration. In this paper, we consider to translate an XML web

document to an instance of an OWL-DL ontology in the Semantic Web.

We use the semantic mapping discovered by our prototype tool [1] for the

relationship between the XML schema and the ontology. Particularly, we

define the solution of the translation problem and develop an algorithm

for computing a canonical solution which enables the ontology to answer

queries by using data in the XML document.

1 Introduction

XML has become an accepted standard for publishing data on the Web. To

integrate XML data, a former paper [8] has studied the ad hoc approach to

translating various XML documents into a central ontology instance. In this

paper, we study a generic and a formal framework for translating an XML doc-

ument into an instance of an ontology. The following example illustrates the

problem. Suppose we have an XML document X:

<db>

<student sname=’’Jerry’’>

<takes>

<course title=’’Database Theory’’/>

<course title=’’Combinatorial Optimization’’/>

</takes>

<advisor pname=’’John’’/>

</student>

</db>

Suppose we have an ontology shown graphically in Figure 1 using UML

notation. Given a natural mapping semantically relating the XML schema to the

ontology, we would expect that an instance of the ontology contains the following

assertions: Student(t1), Course(t2), Course(t3), Professor(t4), hasName(t1,”Jerry”),

hasTitle(t2,”Data base Theory”), hasTitle(t3,”Combinatorial Optimization”), hasName

(t4,”John”), takes(t1,t2), takes (t1,t3), hasAdvisor(t1,t4), Professor(u1), Professor(u2),

Course(u3), teaches(u1,t2), teaches(u2, t3), teaches(t4,u3), where tii = 1,...,4 and

ujj = 1,...3 are anonymous individuals in the ontology.

Page 2

Note that the difference between tis and ujs is that the original XML doc-

ument provides no information about the individuals u1, u2and u3. They were

deduced by the ontology constraints. However, if we replace u1and u2by t4, then

the resulting instance will still satisfy all constraints and it says that professor

t4teaches both courses t2and t3. Alternatively, we could also construct an in-

stance in which the professor t4teaches only the course t2, while the course t3is

taught by some unknown professor u2. This tells us that there could be different

instances that are consistent with the ontology and satisfy a given mapping from

the XML schema to the ontology. So if we are given a source document X shown

above and a query over the ontology, how can we answer it? If our query is, for

example, What is the name of the person who is the advisor of the person whose

name is Jerry? The answer is John regardless of a particular instance that was

created for the ontology. As another example, consider the query What is the

title of the course taught by Jerry’s advisor? This query cannot be answered

with certainty in this scenario.

Ontologies play a central

role in the Semantic web. Re-

cently, W3C has recommended

the OWL web ontology lan-

guage for describing ontolo-

gies in the Semantic Web. If

an XML document needs to

be translated into an OWL

ontology, the resulting ontol-

ogy should preserve the infor-

mation in the XML document and be able to answer queries by using these infor-

mation. Consequently, a translation involves specifying a mapping, checking the

consistency, and preserving information. In this paper, we consider the OWL-DL

ontology language because of its close relationship with Description Logics. As a

result, the OWL-DL ontology language enable us to develop a translation algo-

rithm. The overall framework is generic in the sense that the theoretical issues

apply to many translation problems between databases and ontologies.

The rest of the paper is organized as follows. Section 2 presents the formal

specifications about OWL-DL ontology and XML. Section 3 defines the problem

and Section 4 defines the canonical solution. Section 5 develops the algorithm,

and finally, Section 6 gives the conclusions.

-hasName?

Person?

-hasName?

Student?

-hasName?

Professor?

-hasTitle?

Course?

hasAdvisor?

takes?

teaches?

1..*?

1..*?

1..*?1..*?

1..1?

0..*?

Fig.1. An Ontology

2 Preliminaries

We assume readers are familiar with the standard notations and semantics of De-

scription Logics, though we summarize here one flavor relating to the OWL-DL

web ontology language. OWL-DL is closely related to the SHOIN(D) descrip-

tion logic [5], and the meanings of its terminology can be found in [4,5].

A datatype theory D is a mapping from a set of datatypes to a set of values.

The datatype (or concrete) domain, written ∆I

D, is the union of the mappings

Page 3

of the datatypes. Let R be set of role names consisting of a set of abstract role

names RAand a set of concrete role names RD. The set of SHOIN-roles (or

roles for short) consist of a set of abstract roles RA∪{R−|R ∈ RA} and a set

of concrete roles RD. An RBox R consists of a finite set of transitivity axioms

Trans(R), and role inclusion axioms of the form R ? S and T ? U, where R and

S are abstract roles, and T and U are concrete roles. ?∗denotes the reflexive-

transitive closure of ? on roles, i.e., for two abstract roles R, S, S ?∗R∈R if

S and R are the same, S ? R ∈ R, Inv(S) ? Inv(R) ∈ R, or there exists some

role Q such that S ?∗Q ∈ R and Q ?∗R ∈ R. A role not having transitive

sub-roles is called a simple role, and Inv(R) = R−.

The set of SHOIN(D) concepts is defined by the following syntactic rules,

where Cis are concepts, A is an atomic concept, R is an abstract role, S is an

abstract simple role, T is a concrete role, oi are individuals, D is a datatype,

and n is a non-negative integer:

C → A|¬C|C1? C2|C1? C2|∃R.C|∀R.C| ≥ nS| ≤ nS|

{o1,..,on}| ≥ nT| ≤ nT|∃T.D|∀T.D

A TBox T consists of a finite set of concept inclusion axioms C1? C2; an

ABox A consists of a finite set of concept and role assertions and individual

(in)equalities C(a), R(a,b), a = b, a ?= b, respectively. A SHOIN(D) knowl-

edge base (an ontology) O = (T ,R,A) consists of a TBox T , an RBox R, and

an ABox A. The semantics of SHOIN(D) is given by means of an interpre-

tation I = (∆I,·I) consisting of an non-empty domain ∆I, disjoint from the

datatype domain ∆I

concepts, roles, axioms, and assertions in the standard description logic way. An

interpretation I is a model of the knowledge base O = (T ,R,A) if I satisfies

every concept, axiom, and assertion in O. From the database perspective, the

TBox T and the RBox R can be viewed as a schema with unary and binary

relational tables, and the ABox A can be viewed as an instance. An ABox A

is consistent with respect to O if there is a model of (T ,R,A) (we say O is

consistent). A concept or role assertion β is a logical consequence of an ABox A

(written A |= β), if for every model of A w.r.t < T ,R >, β is true. We write

O |= β for β is a logical consequence of the ontology.

An XML document is typically modeled as a node-labeled tree. For our pur-

pose, we assume that each XML document is described by an XML schema

consisting of a set of element and attribute type definitions. Specifically, we as-

sume the following countably infinite disjoint sets: Ele of element names, Att

of attribute names, and Dom of simple type names including the built-in XML

schema datatypes. Attribute names are preceded by a ”@” to distinguish them

from element names. Given finite sets E ⊂Ele and A ⊂Att, a XML schema

S = (E,A,τ,ρ,κ) specifies the type of each element ? in E, the attributes that ?

has, and the datatype of each attribute in A. Specifically, An element type τ is de-

fined by the grammar τ ::= ?|Sequence[?1: τ1,...?n: τn]|Choice[?1: τ1,..,?n: τn]

(?1,..,?n∈ E), where ? is for the empty type, and Sequence and Choice are com-

plex types. Each element associates an occurrence constraint with two values:

D, and a mapping ·I, which interprets atomic and complex

Page 4

minOccurs indicating the minimum occurrence and maxOccurs indicating the

maximum occurrence. The set of attributes of an element ? ∈ E is defined by the

function ρ : E → 2A; and the function κ : A →Dom specifies the datatypes of

attributes in A. Each datatype name associates with a set of values in a domain

Dom. In this paper, we do not consider the simple type elements (correspond-

ing to DTD’s PCDATA ), assuming instead that they have been represented

using attributes. Furthermore, a special element r ∈ E is the root of the XML

schema such that ρ(r) = ∅, and we assume that for any two element ?i,?j∈ E,

ρ(?i) ∩ ρ(?j) = ∅.

An XML document X = (N,<,r,λ,η) over (E,A) consists of a set of nodes

N, a child relation < between nodes, a root node r, and two functions such as:

– a labeling function λ:N → E ∪A such that if λ(v) = ? ∈ E, we say that v is

in the element type ?; if λ(v) = @a ∈ A, we say that v is an attribute @a;

– a partial function η:N → Dom for every node v with λ(v) = @a ∈ A,

assigning values in domain Dom that supplies values to simple type names

in Dom.

An XML document X = (N,<,r,λ,η) conforms to a schema S = (E,A,τ,ρ,κ),

denoted by X |= S, if:

1. for every node v in X with children v1,..,vm such that λ(vi) ∈ E for i =

1,...,m, if λ(v) = ?, then λ(v1),..., λ(vm) satisfies τ(?) and the occurrence

constraints.

2. for ever node v in X with children u1,...,unsuch that λ(ui) = @ai∈ A for

i = 1,...,n, if λ(v) = ?, then λ(ui) = @ai∈ ρ(?), and η(ui) is a value having

datatype κ(@ai).

Now we turn to the mapping language relating a pattern in an XML schema with

a formula in an ontology. On the XML side, the basic component is attribute for-

mulas [2], which are specified by the syntax α ::= ?|?(@a1= x1,..,@an= xn),

where ? ∈ E, @a1,..,@an ∈ A, E and A are element names and attribute

names respectively; and variables x1,..,xn are the free variables of α. Tree-

pattern formulas over an XML schema S = (E,A,τ,ρ,κ) are defined by ψ ::=

α|α[ϕ1,..,ϕn], where α ranges over attribute formulas over (E,A). The free vari-

ables of a tree formula ψ are the free variables in all the attribute formulas that

occur in it. For example, Company[Department[employee(@eid = x1)[manager

(@mid = x2) [employee( @eid = x3) ]]]] is a tree formula.

An attribute formula is evaluated in a node of an XML document, and values

for free variables come from domain Dom. If X is an XML document over (E,A)

and v a node of X, then

– (X,v) |= ? iff λ(v) = ?, for ? ∈ E.

– if α(x1,...,xn)= ?(@a1= x1,..,@an= xn), then (X,v) |= α(s1,...,sn), where

s1,...,sn∈ Dom, iff λ(v) = ?, and for each child viof v such that λ(vi) = @ai,

η(vi) = sifor i ∈ [1,..n].

Page 5

Given a document X, a tree-pattern formula ψ(x), and a tuple s from Dom,

ψ(s) is satisfied in X (written X |= ψ(s)) if there is a witness node v for ψ(s).

Formally, a witness node for a ψ(s) is defined as follows:

– v is a witness node for α(s), where α is an attribute formula, iff (X,v) |= α(s).

– v is witness node for α(s)[ ψ1(s1),...,ψm(sm)] iff (X,v) |= α(s) and there are

m children v1,...,vmof v such that each viis a witness node for ψi(si), for

i = 1,...,m.

On the ontology side, we use conjunctive formulas with annotations, which treat

atomic concepts and roles as unary and binary predicates, respectively. For ex-

ample, given an ontology containing the atomic concept Employee and roles

hasId, hasManager, and manages, the following is a mapping formula,

Company[Department[

employee(@eid = x1)[

manager (@mid = x2) [

employee (@eid=x3) ]]]] →

Employee(Y1), hasId(Y1,x1), Employee(Y2), hasId(Y2,x2),

hasManager(Y1,x2), Employee(Y3), hasId(Y3,x3), manages(Y2,Y3)::

identif(Y1,x1), identif(Y2,x2), identif(Y3,x3).

There are two sorts of variables. One sort of variables denoted, e.g., by Yis,

represent the individuals in the ontology, and another sort of variables denoted,

e.g., by xjs, represent data values containing the attribute values in the XML

document and concrete values in the ontology. Since attribute values in the XML

document come from the domain Dom, while concrete values in the ontology

come from domain ∆I

D, we assume that each mapping formula implies a set

of conversion functions such that when the single variable name xj is used on

both sides, both datatypes in the corresponding positions are matched through

an implicit conversion function. We denote by ConstValue the set of all data

values that occur in the XML document and we also call them constant values.

In addition, we assume an infinite set VarValue which we call variable values

including an infinite set Individual of individuals and an infinite set DataValue

of data values. We require that ConstValue∩VarValue=∅.

The annotation comes after :: in the mapping formula. Each predicate in

the annotation is of the form identif(Y,Z) in which Y is an individual variable

and Z is a tuple of variables. The meaning of identif(Y,Z) is as follows. The

information in XML document indicates that an individual belonging to the

concept C in which Y is the placeholder variable, i.e, C(Y ) appearing in the

formula, can be identified by a set of roles P1,...,Pn in the ontology, whereas

P1,...,Pnbind Y with Z in the formula, i.e., P1(Y,Z1),...,Pn(Y,Zn) appear. We

will see later that the annotation is important during the translation and for

consistency checking in the ontology. To specify the mapping formulas, we have

proposed a semi-automatic tool MAPONTO in [1].

3 The Problem of Translating XML data into Ontologies

We now define the problem of translating XML into ontologies (X-to-O).