Building Ontologies from XML Data Sources.
Building Ontologies from XML Data Sources
University of Burgundy
University of Burgundy
Abstract—In this paper, we present a tool called X2OWL
that aims at building an OWL ontology from an XML data
source. This method is based on XML schema to automatically
generate the ontology structure, as well as, a set of mapping
bridges. The presented method also includes a refinement step
that allows to clean the mapping bridges and possibly to re-
structure the generated ontology.
Integrating information from heterogeneous information
sources is a critical issue. To achieve an efficient integra-
tion, we have to solve syntactic, structural and semantic
heterogeneities of information sources. Ontologies provide
a promised technology to solve the semantic heterogeneity
problem, because they allow to explicitly represent common
semantics of a domain of discourse. An ontology formally
defines different concepts of a domain and relationships
between these concepts.
In ontology-based approaches for information integra-
tion, local ontologies are used to describe the semantics
of local information sources. The advantage of wrapping
each information source to a local ontology is to allow
the development of source ontology independently of other
sources or ontologies. Hence, the integration task can be
simplified and the addition and removal of sources can be
Information sources can be structured such as relational
databases, or semi-structured such as XML data sources.
However, each information source should be mapped to its
own local ontology. In this paper we focus on mapping XML
data sources to ontologies.
We present a tool, called X2OWL, whose main function is
to automatically create an OWL ontology from an XML data
source. In this approach, data instances are not materialized
at the local ontology. That is, the generated ontology only
contains the concepts and properties but not the instances,
which stay in the source and are retrieved and translated as
needed in response to user queries.
X2OWL tool also generates a mapping document that
describes the correspondences between the entities of the
XML source and the resulting local ontology. This mapping
document is useful for query processing purposes. Our
approach also includes a refinement step that allows to
clean the mapping bridges and eventually to restructure the
II. RELATED WORKS
These approaches can be classified into two main cate-
gories, 1) approaches that create an ontology from XML
document, and 2) approaches that map an XML document
to an existing ontology. In this paper, we are interested in
the first category.
Ferdinand et al. , propose an approach to build an
OWL ontology from an XML schema and to transform
XML documents to RDF graphs. The XML schema to OWL
mapping process is based on pre-defined mapping rules.
OWL classes emerge from XML schema complex types,
model group definitions and attribute group definitions.
OWL object properties emerge from elements of complex
type. OWL datatype properties emerge from elements of
simple type and from attributes. Finally, class inheritance
emerge from XML schema inheritance by restriction and
inheritance by extension.
Bohring et al.  propose a similar approach to create
OWL ontologies from XML schemas. In this approach,
OWL classes emerge from named XSD complex Types and
XSD elements containing other elements or having at least
one attribute. When an element contains another element, an
OWL object property is created between their corresponding
OWL classes. OWL datatype properties emerge from XML
attributes and from element containing only a literal and no
Both of Ferdinand’s and Bohring’s works introduce a
good basis of rules to create OWL ontologies from XML.
However, they address only simple cases and do not refer to
complex cases that arise from the reuse of global types and
elements. Also, they do not mention how to specify map-
pings between XML source and generated OWL ontology.
Cruz et al.  propose an approach to integrate hetero-
geneous XML sources using an ontology-based mediation
architecture. The ontology integration process contains two
steps: schema transformation and ontology merging. In the
first step, RDFS is used to model each XML source as
a local RDF ontology to achieve a uniform representation
basis for the ontology merging step. The transformation from
XML to RDF is done as follows: complex-type elements
are transformed to rdfs:Class, attributes and simple-
type elements are transformed to rdfs:Property, and
element-subelement relationship is encoded as a class-to-
class relationship using a new defined RDFS predicate
“rdfx:contain”. In this work the resulting ontology is
somehow semantically-poor, since it is based on RDF, and
because of the way used to represent element-subelement
relationship (using “rdfx:contain”).
Xu and Li  propose an approach to construct OWL
ontology from XML document with the help of entity-
relation model. That is, they propose an XML-to-Relational
(XTR) mapping approach to map an XML document to
an entity-relation model, and then a Relational-to-Ontology
(RTO) mapping approach to map an entity-relation model to
an OWL ontology. However, the OWL ontology is expressed
using ad-hoc vocabularies for describing relational database,
therefore it can not be considered as domain ontology.
We propose an extended approach to create an OWL
ontology from an XML data source. This approach takes into
account complex cases arising from different XML schema
design styles. Our approach also provides a set of mapping
bridges between the entities of the XML source and the
III. OUR APPROACH
In order to achieve an efficient and complete method for
building OWL ontologies from XML data sources, several
aspects have to be taken in account:
1) The method should be based on XML schemas instead
of documents, because an XML schema can be used
by multiple documents. This will avoid generating
multiple ontologies for multiple documents conform-
ing to the same schema.
2) The method should be able to provide mapping bridges
that specify the correspondences between XML enti-
ties and OWL terms. Such mapping bridges contribute
into query translation between OWL and XML.
3) The method should rely on XML schema’ type dec-
larations (instead of element declarations) in order
to benefit from the reusability of types by several
elements within the schema. Relying on elements
declarations causes generating redundant OWL terms
from multiple elements of the same type.
4) XML schemas can be modeled using different styles.
Some of them use a single global element (root
element), others use multiple global elements. Some
styles use global types, others use only local types.
However, the mapping method should cope with all
possible design patterns.
5) The method should include a finalization step that re-
fine the generated ontology and mapping bridges. The
purpose of such refinement is to adjust the structure of
the generated ontology and to remove useless mapping
Our proposed method to build OWL ontologies from
XML data sources fulfills all these requirements. Firstly,
it is based on XML schema to build the ontology. If the
schema does not exist, it can be automatically generated
from the source XML document. Despite this occasional
step, the proposed method comprises two processes: 1)
automatic generation of OWL ontology from XML schema,
and 2) manual refinement of the generated ontology and the
We will use the same notations used in  to specify
XML-to-OWL mappings. That is, three types of mappings
• Class mapping: Maps an XML node to an OWL
• Datatype property mapping: Maps an XML node to an
OWL datatype property.
• Object property mapping: Relates two class mappings
to an OWL object property.
In these mappings OWL resources (classes, object and
datatype properties) are addressed using their URI refer-
ences, and mapped XML nodes are addressed using XPath
expressions. In order to allow our method to cope with
all possible design patterns of XML schemas, we define
our mapping rules and algorithm in a pattern-independent
A. Mapping Rules
Our proposed method is based on XML schema, that is,
the entities of schema are transformed to OWL entities. Ba-
sically, OWL Classes emerge from complex types, element-
group declarations, and attribute-group declarations. Object
properties emerge from element-subelement relationships.
Datatype properties emerge from attributes and from simple
OWL classes: We can distinguish two kinds of com-
plex types: 1) global, named complex types, and 2) local
anonymous complex types. Both cases are mapped to OWL
classes. However, a class generated from a global named
type will have the name of that type, while a class generated
from local anonymous type will have the name of the
(only) surrounding element. Element-group and attribute-
group declarations are also mapped to OWL classes.
XML schema supports two mechanisms of inheritance:
extension and restriction. Both of these inheritance mech-
anisms are translated to the class inheritance mechanism
of OWL (using rdfs:subClassOf). When a complex type
is defined as an extension or a restriction of another base
complex type, then the class corresponding to this type is
set as subclass of the class corresponding to the base type.
Object properties: Elements (global or local) are not
mapped directly to the ontology, but the element-subelement
relationship in the schema is translated as object property
in the ontology. That is, when an element has a complex
type, then that complex type is already mapped to an
OWL class. Therefore, an object property is added to the
ontology having as domain the class corresponding to the
surrounding complex type, and having as range the OWL
class corresponding to the type of the element. The name of
this object property is the concatenation of “has” with the
name of range class.
Datatype properties: If an element has a simple type,
then it is mapped to a datatype property having as domain
the OWL class corresponding to the surrounding complex
type, and having as range its XSD datatype. Attributes are
treated as simple elements and will be mapped to datatype
properties. If a complex type is mixed, then the elements
that have this type contain text as well as subelements
and/or attributes. To take this text into account, a datatype
property is added to the ontology having as domain the class
corresponding to the surrounding complex type, and having
as domain “xsd:string” datatype.
B. Mapping Algorithm
In this section, we present the algorithm that applies our
mapping rules on the XML schema in order to generate
OWL ontology entities and the suitable mapping bridges.
To insure the independence of the schema design style, our
algorithm is based on an XML Schema Graph (XSG) that
describes the schema in the same way whatever its design
An XML Schema Graph G = (V,E) is generated from
the XML schema, where V is the vertex set, and E is the
edge set. The set V contains all elements, attributes, non-
primitive types, element groups and attribute groups. The
set E contains the edges established:
• from each element to its type (if not primitive),
• from each type, element group or attribute group to
their contained elements and/or attributes.
An XSG is a directed acyclic graph (DAG) that has
always a unique root vertex which is the vertex of the root
element of XML document. An XSG becomes a tree when
elements and types declarations are not re-used within the
schema. Our method to generate OWL ontology is based
on this XSG. Starting from the root vertex, the XSG is
visited depth-first. For every visit of an element or attribute
vertex, an XPath expression is computed. Since each vertex
can be visited more than once, it can have several XPath
When we visit a vertex velof an element el, we carry in-
formation about its (current) parent XPath xpathparent, par-
ent OWL class Cparentand parent class mapping CMparent.
Firstly, an XPath expression for el is computed as
xpathel= xpathparent+ “/” + el.
If el has a complex type CTel, then we create an OWL
class Cel(if not created from a previous visit). The name of
this class is the name of the type CTelif it is global. But if
the type CTelis local then the class Celwill have the name
of the element el itself.
• A class mapping CMel is created as CMel
• An object property OPelis created from Cparentto Cel
(if not created from a previous visit) having the name
of Celwith the prefix “has”.
• An object property mapping OPMel is created as
If el has a simple type STel, then we create a datatype
property DTPel. Its domain is Cparent and its range is
the XSD datatype of STel (if primitive, or xsd:anyType
otherwise). A datatype property mapping DPMelis created
as DPMel= (DTPel,CMparent,xpathel+ “/text()”)
After the treatment of the element el, its children vertices
are visited, and we carry xpathel, Cel, and CMelas parent
information for treating those children.
When an attribute vertex is visited, it is treated as an
element vertex of a simple type. When visiting type vertices,
no treatment is performed, because types are handled when
visiting their owner elements/attributes.
C. Detailed Example
Figure 1 shows an XML document describing a shipment
order that is composed of an order person, a list of items
and a list of shipments. Each item contains a title, quantity,
and price, while each shipment contains a date and a list of
items shipped. We can note that items are mentioned in a
shipment as an element with an attribute title, whereas
they are mentioned in the list of items as element with three
sub-elements: title, quantity and price. However, in
the XML schema (Figure 2) only one global element item
is mentioned, and it is referenced by the list of items (the
element items) and the element ship. In addition, the
element item in the schema contains a subelement as well
as an attribute both named title.
In fact, this XML example could appear unwell designed
having the same semantic presented twice as a sub-element
and as an attribute (title of an item). But we deliberately
choose this example to demonstrate that our method works
even if these cases occur.
Figure 3 shows the OWL ontology generated from the
XML schema. We can see that one OWL class is created
for each complex type in the schema, an object property
is created between every two classes corresponding to two
nested elements/types, and a datatype property is created for
each attribute and each element of simple type.
Since the element item is referred to by two other
elements, then the object property hasItem should have
two domains Items and Ship. Multiple domain axioms are
allowed in OWL and should be interpreted as a conjunction
. Therefore, in order to say that the domain of the property
hasItem can be either a Ship or Items, we should set
the union of these classes as the domain of hasItem. In
<title>Hide your heart</title>
<title>Hearts of Fire</title>
<item title="Empire Burlesque" />
<item title="Hide your heart" />
Figure 1.XML document example
addition, we can see that the class Item has one datatype
property title, although this property corresponds to
different entities in the schema: title element and title
attribute of the element item.
Figure 4 shows the mapping bridges established during
the ontology generation process. We can note that an OWL
term can be related to many XML terms. For example, the
class item has two class mappings, the first one relates it to
/shiporder/items/item, and the second one relates
it to /shiporder/ships/ship/item.
We can also note that the automatic nature of map-
ping generation causes some invalid mapping bridges.
Such invalid mappings are due to the fact that dif-
ferent types and elements in the schema reference or
share the same type and/or element. Thus, some XPath
expressions that are automatically induced from the
schema are not actually valid in the original XML doc-
ument. For example, one of the mapping bridges of the
datatype property quantity contains the XPath expres-
sions /shiporder/ships/ship/item/quantity/text() that
is generated because the element item is referenced by
both items and ship in the schema. This expression is
not valid because the element ship/item in the document
does not contain the element quantity.
The first purpose of refinement step is to detect and
remove invalid mapping bridges. Invalid mappings have to
be removed because if they will be used in query resolution
they will lead to invalid queries that return no results.
We say that a mapping bridge is invalid if it contains an
invalid XPath expression (for a given XML document) or
<xs:element name="orderperson" type="xs:string"/>
<xs:attribute name="orderid" use="required"
<xs:element maxOccurs="unbounded" ref="item"/>
<xs:element name="date" type="xs:NMTOKEN"/>
<xs:element maxOccurs="unbounded" ref="item"/>
<xs:element name="title" type="xs:string"/>
<xs:element name="quantity" type="xs:integer"/>
<xs:element name="price" type="xs:decimal"/>
<xs:attribute name="title" type="xs:string"/>
Figure 2. XML schema example
Figure 3. Generated OWL ontology
it references another invalid mapping bridge. In Figure 4
striked-out mapping bridges are invalid because they contain
invalid XPaths with respect to the original XML document
of Figure 1.
Detecting invalid mappings can be done automatically if
an XML document is provided which is considered represen-
tative/typical of all XML documents conforming to the used
XML schema. In this case, all possible XPath expressions
of this document are extracted. Then, XPath expressions
of the mapping bridges are compared with those extracted
cm1 = (shiporder, /shiporder)
cm2 = (items, /shiporder/items)
cm3 = (item, /shiporder/items/item)
cm4 = (ships, /shiporder/ships)
cm5 = (ship, /shiporder/ships/ship)
cm6 = (item, /shiporder/ships/ship/item)
Datatype Property Mappings
dm1 = (orderid, cm1, /shiporder/@orderid)
dm2 = (title, cm3, /shiporder/items/item/@title)
dm3 = (quantity, cm3, /shiporder/items/item/quantity/text())
dm4 = (title, cm3, /shiporder/items/item/title/text())
dm5 = (price, cm3, /shiporder/items/item/price/text())
dm6 = (orderperson, cm1, /shiporder/orderperson/text())
dm7 = (title, cm6, /shiporder/ships/ship/item/@title)
dm8 = (quantity, cm6, /shiporder/ships/ship/item/quantity/text())
dm9 = (title, cm6, /shiporder/ships/ship/item/title/text())
dm10 = (price, cm6, /shiporder/ships/ship/item/price/text())
dm11 = (date, cm5, /shiporder/ships/ship/date/text())
Object Property Mappings
om1 = (hasItems, cm1, cm2)
om2 = (hasItem, cm2, cm3)
om3 = (hasShips, cm1, cm4)
om4 = (hasShip, cm4, cm5)
om5 = (hasItem, cm5, cm6)
Figure 4. Mapping Bridges
from the typical XML document. Any mapping bridge
that contains an XPath expression non-belonging to the
typical XML document is considered invalid. Furthermore,
mapping bridges are rescanned to detect mapping bridges
that reference invalid mapping bridges. Those bridges are
also considered invalid. The final result of this process is a
clean mapping document that only contain valid bridges. If
no typical XML document is provided the process can be
done by a human expert manually.
The refinement step also includes an optional process of
restructuring the generated ontology. Humans may not admit
the structure of the automatically generated ontology. Our
approach allows a human expert to modify the ontology
structure manually. For example, he can rename or remove
ontology terms, or change the domain and the range of a
property. However, modifying the ontology structure neces-
sitates appropriate modifications of mapping bridges in order
to keep them consistent with the ontology.
In our example, the expert may decide to remove the class
ships and then to relate the class shiporder directly
to the class ship (via the object property hasShip).
This change requires the removal of the class mapping
cm4 = (ships,/shiporder/ships) and the object
property mapping om3 = (hasShips,cm1,cm4), and
the modification of the object property mapping om4 to be:
om4 = (hasShip,cm1,cm5).
In this paper, we have presented a method to generate
an OWL ontology from an XML data source. This method
is based on XML schema to automatically generate the
Figure 5.X2OWL Prototype
ontology structure, as well as, a set of mapping bridges.
The presented method also includes a refinement step that
allows to clean the mapping bridges and to re-structure the
We have developed a tool, called X2OWL, as an imple-
mentation of the proposed method (Figure 5). This tool is
written in Java and it uses several online-available APIs such
as, Jena1for building OWL ontologies, Trang2for generating
XML schemas from XML documents, XSOM3for analyzing
XML schemas, and JUNG4for graph-based manipulations.
 Bohring, H. and Auer, S.: Mapping XML to OWL Ontologies.
In Leipziger Informatik-Tage, vol. 72, 147–156, (2005).
 Cruz, I. F., Xiao, H., and Hsu, F.: An Ontology-based
Framework for XML Semantic Integration. In IDEAS ’04:
Proceedings of the International Database Engineering and
Applications Symposium, 217–226, (2004).
 Ferdinand, M., Zirpins, C., and Trastour, D.: Lifting XML
Schema to OWL. In Web Engineering - 4th International
Conference, ICWE 2004, Munich, Germany, 354–358, (2004).
 McGuinness, D. L., and van Harmelen, F.: OWL Web Ontology
Language Overview. W3C recommendation, W3C, (2004).
 Rodrigues, T., Rosa, P., and Cardoso, J.: Mapping XML
to Exiting OWL ontologies. In International Conference
WWW/Internet 2006, 72–77, (2006).
 Xu, J. and Li, W.: Using Relational Database to Build OWL
Ontology from XML Data Sources. In CISW ’07: Proceedings
of the 2007 International Conference on Computational In-
telligence and Security Workshops, 124–127, IEEE Computer
Society, Washington, DC, USA, (2007).