ArticlePDF Available

Abstract and Figures

The Web of Linked Data is a huge graph of distributed and interlinked datasources fueled by structured information. This new environment calls for formal languages and tools to automatize navigation across datasources (nodes in such graph) and enable semantic-aware and Web-scale search mechanisms. In this article we introduce a declarative navigational language for the Web of Linked Data graph called NAUTILOD. NAUTILOD enables one to specify datasources via the intertwining of navigation and querying capabilities. It also features a mechanism to specify actions (e.g., send notification messages) that obtain their parameters from datasources reached during the navigation. We provide a formalization of the NAUTILOD semantics, which captures both nodes and fragments of the Web of Linked Data. We present algorithms to implement such semantics and study their computational complexity. We discuss an implementation of the features of NAUTILOD in a tool called swget, which exploits current Web technologies and protocols. We report on the evaluation of swget and its comparison with related work. Finally, we show the usefulness of capturing Web fragments by providing examples in different knowledge domains.
Content may be subject to copyright.
0
NAUT ILOD: A Formal Language for the Web of Data Graph
VALERIA FIONDA, Department of Mathematics and Computer Science, University of Calabria
GIUSEPPE PIRR `
O, WeST, University of Koblenz
CLAUDIO GUTIERREZ, DCC, University of Chile
The Web of Linked Data is a huge graph of distributed and interlinked datasources fueled by structured
information. This new environment calls for formal languages and tools to automatize navigation across
datasources (nodes in such graph) and enable semantic-aware and Web scale search mechanisms. In this pa-
per we introduce a declarative navigational language for the Web of Linked Data graph called NAUT ILOD.
NAUT ILOD enables to specify datasources via the intertwining of navigation and querying capabilities. It
also features a mechanism to specify actions (e.g., send notification messages) that obtain their parameters
from datasources reached during the navigation. We provide a formalization of the NAUT ILOD semantics,
which captures both nodes and fragments of the Web of Linked Data. We present algorithms to implement
such semantics and study their computational complexity. We discuss an implementation of the features of
NAUT ILOD in a tool called swget, which exploits current Web technologies and protocols. We report on the
evaluation of swget and its comparison with related work. Finally, we show the usefulness of capturing Web
fragments by providing examples in different knowledge domains.
Categories and Subject Descriptors: H.2.3 [DATABASE MANAGEMENT]: Languages; H.2.4 [DATABASE
MANAGEMENT]: Systems
General Terms: Design, Algorithms
Additional Key Words and Phrases: Navigation, Graph languages, Web of Data, Linked Data, Semantic Web
ACM Reference Format:
Fionda, V., Pirr`
o, G., Gutierrez, C., 2014. NAUT ILOD: A Formal Language for the Web of Data Graph. ACM
Trans. Web 0, 0, Article 0 ( 0), 43 pages.
DOI:http://dx.doi.org/10.1145/0000000.0000000
Author’s addresses: V. Fionda, Department of Mathematics and Computer Science, University of
Calabria, email:fionda@mat.unical.it (contact author); G. Pirr`
o, WeST, University of Koblenz-Landau,
email:pirro@uni-koblenz.de (contact author); C. Gutierrez, Computer Science Department, University of
Chile, email cgutierr@dcc.uchile.cl.
A preliminary version of this paper titled “Semantic navigation on the Web of data: specification of routes,
Web fragments and actions” appeared in Proceedings of the 21st World Wide Web Conference, pp. 281-290,
ACM Press, 2012.
V. Fionda’s work was partially supported by the European Commission, the European Social Fund and the
Calabria region. G. Pirr`
o was partially supported by the EU Framework Programme for Research and Inno-
vation under grant agreement no. 611242 (Sense4us). C. Gutierrez was supported by projects FONDECYT
No. 1110287 and Millenium Nucleus CIWS, NC120004.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c
0 ACM 1559-1131/0/-ART0 $15.00
DOI:http://dx.doi.org/10.1145/0000000.0000000
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:2 V. Fionda et al.
1. INTRODUCTION
There is an increasing availability of structured data on the Web. A vast portion of
the Web can now be thought of as a large graph where nodes represent datasources
describing entities (e.g., people, places) and edges semantic relations (e.g., born in, lo-
cated in) between them. Such an enhanced version of the Web is usually referred to as
the Web of Linked Data (WLOD). There are different projects that contribute to the
spread of the WLOD. On one hand, Linking Open Data1[Heath and Bizer 2011] aims
at setting some principles for the publishing and interlinking of data on the Web by
using the Resource Description Framework (RDF) [Klyne et al. 2004]. One key aspect
of such project is the decentralized nature of data. On the other hand, applications
like the Google Knowledge Graph (KG)2and the Facebook Graph (FG)3allow to access
structured descriptions of entities from centralized gateways.
Although different in their goals, the common technical basis of all these approaches
are large graphs that model structured information and emphasize the functionalities
of discovery and exploration that are achieved by (manual) navigation from the entity
(node in the graph) currently being visited toward semantically related entities, via la-
beled edges. The querying capabilities in these approaches do not enable to go beyond
asking “tell me friends living in my same city” or matching keywords to specific enti-
ties. It is not possible to perform simple information requests involving, for instance,
the checking of paths, although questions like “find people up to three hops away from
me that like music and live in my same city” sound very natural when dealing with
graphs. Little flexibility is given to Web developers and users in terms of languages
and tools capable of harnessing and putting together the power of the huge amount of
structured data available in the WLOD [Weaver and Tarjan 2013].
In this paper we introduce a formal graph navigational language for the WLOD
called NAUTILOD. It enables to write navigational expressions reflecting conceptual
specifications of both nodes and fragments (subgraphs) of interest in the WLOD. These
formal specifications are expressed in a declarative way and enable to navigate across
datasources in the WLOD without human intervention. NAUTI LOD also introduces
the functionality of specifying actions. Actions allow to send alerts, mails, and open a
whole new world of functionalities that contribute to the automation of information
search on the WLOD. The technological underpinning of our proposal are open and
standard technologies such the RDF data format and the SPARQL query language for
RDF [Harris and Seaborne 2013]. In the remainder of this introductory section, we
present more formally the WLOD; the notion of navigation on the Web, which takes
advantage of the semantics of data on the Web; an overview of the general design of
NAUT ILOD; and the structure of the paper.
1.1. From the Web of documents to the Web of Linked Data
The Web is usually modeled as a graph of links between pages [Brin and Page 1998].
This model has some peculiarities. First, links between pages are unlabeled and can
only be expressed at one of the two endpoints4. For example, in Wikipedia a link from
the HTML page about Rome (i.e., wiki:Rome5) to the page about Italy (i.e., wiki:Italy)
can only be set in the page wiki:Rome. In the WLOD links can be set and be part of any
datasource in the spirit of Tim Berners Lee’s words “Anyone can say anything about
any topic and publish it anywhere” [Berners-Lee 1998]. For instance, a link between
1http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
2http://www.google.com/insidesearch/features/search/knowledge.html
3http://www.facebook.com/about/graphsearch
4X-Link (http://www.w3.org/TR/xlink11/) was proposed to go beyond such limitation
5The prefixes used in the paper are those available at http://prefix.cc
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:3
the DBpedia resources Rome (i.e., dbpedia:Rome) and Italy (i.e., dbpedia:Italy) can be
part of any RDF datasource. Indeed, an arbitrary datasource didentified by the URI u
can contain the triple (dbpedia:Rome,dbpo:country,dbpedia:Italy). The triple above
will be then available when dereferencing u. Note that differently from the case of
HTML pages, now the link includes a label that specifies one of the possible semantic
relations between dbpedia:Rome and dbpedia:Italy. Labels carry a semantic mean-
ing as being part of conceptual specifications (e.g., ontologies, thesauri) used to model
knowledge domains.
Second, although Web pages are created and kept distributively, their small size and
lack of structure stimulated the idea to view searching and querying through single
and centralized repositories. These repositories are built via crawlers that starting
from a set of seed pages navigate the Web graph to reach Web pages whose content will
be processed and stored. The WLOD can be seen as a semantic graph where nodes are
autonomous datasources identified by URIs and maintaining sets of RDF triples some
of which provide links toward other datasources. This model allows to better express
the distributed creation and maintenance of data, and the fact that the structure of the
WLOD is provided by dynamic and distributed datasources. In particular, it reflects
the fact that at each moment in time, and for each particular agent, the whole graph
of data on the Web is unknown [Mendelzon et al. 1997].
Fig. 1 (a) shows an excerpt of the Web of documents with some HTML pages and
(syntactic) links between them. For instance, the page about the R. Johnson blues
foundation links to the (official) Web page about E. Clapton. Note that there is no link
between the page about R. Johnson and the page about the city of Greenwood where
he passed away. Fig. 1 (b) shows information about the same entities reported in Fig. 1
(a) taken from DBpedia, the counterpart of Wikipedia in the WLOD. Here, note the
availability of structured information in RDF such as the triple that links R. Johnson
to the city of Greenwood. The label of this link, that is, dbpo:deathPlace carries a
semantic meaning as being formally defined in the DBpedia ontology6.
dbpedia:Jimi_Hendrix
<dbpedia:Robert_Johnson,dbpo:deathPlace,dbpedia:Greenwood>
<dbpedia:Robert_Johnson,dbpo:genre,dbpedia:Delta_blues>
<dbpedia:Robert_Johnson,rdf:type,foaf:Person>
<dbpedia:Robert_Johnson,dbpo:writerOf,dbpedia:California_blues>
...
RDF triples (Description)
dbpo:
deathPlace
dbpo:
influenced
dbpo:
belongsTo
dbpo:
member
dbpedia:Eric_Clapton
dbpedia:Greenwood
dbpedia:27_Club
dbpedia:Robert_Johnson
http://
www.ericclapton.com/
Eric Patrick Clapton was
born on 30 March 1945 in
his grandparents’ home at 1
The Green, Ripley, Surrey,
England. He was the son of
16-year-old Patricia Molly
Clapton (b. 7 January 1929,
d. March 1999) and Edward
Walter Fryer (b. 21 March
1920…..
http://
www.jimihendrix.com
Widely recognized as one
of the most creative and
influential musicians of the
20th century, Jimi Hendrix
pioneered the explosive
possibilities of the electric
guitar. His musical
language continues to
influence a host of modern
musicians...
….. of universal relevance
and global reach. “You want
to know how good the blues
can get?” Keith Richards
once asked, answering his
own question: “Well, this is
it.” Eric Clapton put it more
plainly: “I have never found
anything more deeply
soulful than Robert
Johnson….
http://www.forever27.co.uk/
Special treat from the
Jimi Hendrix official
website - Jimi Hendrix
'Off the Record.' !The
real Jimi, his influences
and his upbringing from
his sister …..
(a)
Johnson spent time in the
Baptist Town area of
Greenwood just before his
career ended abruptly a
year later…..Johnson was
taken to a shotgun house at
the Star of the West
Plantation north of
Greenwood, where he
lingered for several days
before dying on August 16,
1938, at the age of 27.
http://
www.greenwoodms.com/
(b)
Fig. 1. Web of documents versus Web of Linked Data (WLOD).
6http://wiki.dbpedia.org/Ontology
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:4 V. Fionda et al.
1.2. Semantic navigation on the Web of Data
The availability of structured data in the form of graphs calls for adequate languages
that go beyond keyword-based mechanisms. The traditional tool to get information or
knowledge from structured data is a query language. In the world of RDF, there is
a standard query language called SPARQL [Harris and Seaborne 2013]. Querying in
SPARQL implies the access to one or more datasources to satisfy an information need.
However, the language offers limited support in terms of capabilities related to (dy-
namic) “exploration” and “discovery” of datasources that characterize all the emerging
graph applications. One fundamental ingredient toward this goal is the support for
graph navigation functionalities. Generally speaking, navigation is the process of go-
ing, guided by some driving directions, from the known to the unknown in a given
space. In the case of RDF graphs, the space is set by the graph topology (encoded
by RDF triples) and navigation occurs by traversing edges. Consider for instance,
the discovery of knowledge about musicians influenced by R. Johnson in Fig. 1 (b);
starting from the node R. Johnson in DBpedia (the known) it is possible to navigate
toward the nodes of other musicians (the unknown) by traversing edges labeled as
dbpo:influenced (the driving directions). SPARQL (via property paths [Harris and
Seaborne 2013]) and other extensions (e.g., nSPARQL [P´
erez et al. 2010]) offer graph
navigational functionalities that restrict their scope to local RDF graphs; they offer
little support in terms of navigation in the WLOD graph.
In the WLOD, navigation can go beyond the classical crawling that consists in
traversing all the edges toward other nodes; it can be driven by some high-level se-
mantic specification that encodes a reachability test, that is, the checking whether
from a given node there exists a path (defined by considering edge labels) toward other
nodes. In graph query languages, the specification is usually given by using regular
expressions over the alphabet of edge labels. Reachability as well as other approaches
to query information from graphs have been largely studied in graph data manage-
ment [Angles and Gutierrez 2008; Wood 2012]; however, no proposal has been formal-
ized for distributed graphs such as the WLOD. In this setting, neither querying nor
navigation alone are enough; navigation can actually be complemented with querying.
Navigation is necessary since the topology of the space of datasources to be queried
is (from a practical point of view) unbounded, not known in its entirety, and dynamic.
Querying is important since each node in the WLOD is an RDF datasource that can be
queried upon to filter and drive the subsequent steps of the navigation. The idea devel-
oped in this paper is to have a navigational language that makes usage of querying to
both (i) drive the navigation by filtering datasources according to the pieces of informa-
tion they store and (ii) dynamically retrieve data from datasources encountered during
the navigation and use such data to perform some basic actions (e.g., send notification
messages). The navigation model proposed in this paper has the unique feature of en-
abling the retrieval of Web fragments, that is, graphs including nodes and edges visited
when evaluating an expression. This goes beyond the classical navigational model fo-
cused on retrieving nodes. These considerations are at the basis of our proposal for a
navigational language for the WLOD.
1.3. The NAUTILOD language: an overview
We see a navigational language for the WLOD graph as a way of providing instructions
in the form of navigational expressions. Such expressions enable to navigate from a
given node toward nodes of interest by adjusting the navigation according to dynami-
cally discovered edges and nodes. Our desideratum is to define a simple language that
can help developers and users in performing at least the following basic tasks in a
declarative and integrated manner: (i) specification of driving directions, that is, se-
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:5
fb:<http://rdf.freebase.com/ns/>
owl:<http://www.w3.org/2002/07/owl/>
rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
foaf:<http://xmlns.com/foaf/spec/>
dbpo:
writer
dbpedia:
John_
Grisham
foaf:
depiction
dbpedia:
Jonesboro
Arkansas
dbpo:
birthPlace
foaf:
primaryTopic
http://en.wikipedia.org/
wiki/John_Grisham
Writer
dbpo:
occupation
dbpedia:
Runaway_
Jury
foaf:
depiction
http://en.wikipedia.org/
wiki/Runway_Jury
6.0E7
dbpo:
budget
dbpo:Film
rdf:
type
nyt:...
64373
Grisham,
John
owl:sameAs
dbpedia.org
nyt.com
freebase.org
fb:Runaway
Jury
Crime
fiction
fb:
genre
owl:sameAs
dbpedia:<http://dbpedia.org/resource/>
dbpo:<http://dbpedia.org/ontology/> nyt:<http://data.nytimes.com/>
foaf:
isPrimaryTopicOf
nyt:N88099498865828113843
fb:
starring
fb:John
Cusack
fb:Dustin
Hoffman
fb:
starring
dbpedia:
John_
Cusack
dbpo:
starring
owl:sameAs
nyt:...
13834
Cusack,
John
owl:sameAs
http://en.wikipedia.org/
wiki/John_Cusack
dbyago:<http://dbpedia.org/class/yago/>
rdf:
type
dbyago:American
FilmActors
dbpedia:
Rachel_
Weisz
rdf:type
dbyago:English
FilmActors
dbpo:
starring
skos:
prefLabel
skos:<http://www.w3.org/2004/02/skos/core#>
foaf:
isPrimaryTopicOf
foaf:
primaryTopic
skos:
prefLabel
Fig. 2. An excerpt of data that can be navigated from dbp:John Grisham.
mantic descriptions of routes allowing to traverse and retrieve nodes and fragments of
the WLOD; (ii) semi-automatic navigation across datasources “driven” by locally found
information; (iii) specification of actions to be performed over the data.
This paper presents a formal language for the WLOD that we call NA UTI LOD (Nav-
igational Language for Linked Open Data). It is based on regular expressions over
RDF predicates plus tests using Boolean SPARQL queries performed over RDF data-
sources. Regular expressions enable the posing of complex information needs involving
navigation along the nodes of the WLOD graph, while tests enable the selection and fil-
tering of relevant datasources from which to continue the navigation. NAUTILOD also
features a mechanism to specify actions (e.g., send notification messages) that may
use data encountered during the navigation. Although there exist several languages
to process and query RDF data, the type of dynamic and open high-level specifica-
tions featured by NAUTI LOD cannot be systematically simulated by these languages
(see Related Work). Some approaches enhance SPARQL with navigational features
(e.g., [Alkhateeb et al. 2009; P´
erez et al. 2010; Harris and Seaborne 2013]) over fixed
sets of datasources (typically a single RDF graph) but do not address Web scale naviga-
tion. Crawlers like LDSpider [Isele et al. 2010] offer limited semantic control over the
navigation; in other words it is not possible to select only (a subset of) relevant data-
sources. NAUTILOD shares some features with approaches that extend the scope of
SPARQL queries over the WLOD [Hartig 2011; Umbrich et al. 2014]. Here, the traver-
sal of edges is introduced as an extension of the SPARQL semantics, thus limiting
explicit control over the navigation. Finally, none of the approaches above mentioned
incorporates the declarative specification of actions to be triggered over datasources.
1.4. NAU TILOD by example
The tenet of our proposal is a language to provide high level specifications to drive the
navigation in the WLOD graph toward precise destinations. To show the potentialities
of NAUTILOD we present some examples using the excerpt of real-world data shown
in Fig. 2. The formal syntax and semantics are introduced in Section 3. For sake of
space we will use prefixes, instead of full URIs, as defined at http://prefix.cc.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:6 V. Fionda et al.
EXAMPLE 1.1. (Aliases via owl:sameAs) Specify and retrieve documents associ-
ated to John Grisham in DBpedia and his possible aliases in other datasources.
In this example, the idea is to consider owl:sameAs-paths that originate from Gr-
isham’s URI in DBpedia, the starting point of the navigation in the WLOD graph. The
predicate owl:sameAs states that two resources of the WLOD represent the same en-
tity. Recursively, for each URI ureached, check in its corresponding datasource triples
of the form (u,owl:sameAs,v); select all v’s found. Finally, for each such v, return all
URIs win triples of the form (v,foaf:primaryTopic,w). The specification can be given
via the following NAUTI LOD expression:
dbpedia:John Grisham (owl:sameAs)/foaf:primaryTopic
In Fig. 2, when evaluating this expression starting from the URI
dbpedia:John Grisham we get all the different representations of Grisham provided by
dbpedia.org and nyt.com. From these nodes, the expression foaf:primaryTopic
is evaluated. The results of the evaluation is the set {wiki:John Grisham,
nyt:N88099498865828113843}. Note that the search for documents associated to
Grisham, if restricted only to DBpedia, would not have allowed to include documents
available in the New York Times datasource. The high level specification given in this
expression underlines the main feature of NAUTI LOD: the capability to deal with
distributed and dynamically discovered datasources.
We would like to point out that there is a substantial difference between NAU-
TILOD and other approaches (e.g., crawlers like LDSPider [Isele et al. 2010]) that
could retrieve portions of the WLOD by “crawling” all the data. It consists in the fact
that NAUTILOD’s focus is on providing a declarative way to express driving direc-
tions that enable to locate precisely and selectively specific parts of the WLOD graph.
Moreover, NAUTI LOD goes beyond the scope of current navigational languages (e.g.,
SPARQL property paths, nSPARQL [P´
erez et al. 2010], PSPARQL [Alkhateeb et al.
2009], RPL [Zauner et al. 2010]); while these languages are meant to be evaluated
over local graphs NAUTILOD targets navigation on the WLOD graph. A more complex
example, which extends the scope of current navigational languages with capabilities
for specifying actions and cannot be simulated by any of the existing proposals is:
EXA MPL E 1.2. Specify American actors (and their aliases) that have played in at
least one movie based on a book written by J. Grisham. Send by email the Wiki pages
of such actors.
This request involves paths of the form dbpo:writer/dbpo:starring and aliases as
in the previous example; tests (expressed in NAUTI LOD using ASK-SPARQL queries)
over the datasource associated to a given URI (if somebody that played in a movie
written by J. Grisham is found, check if s/he is an American actress/actor); and actions
to be performed using data from the datasource. The specification in NAUTI LOD is:
dbpedia:John Grisham dbpo:writer/dbpo:starring[Test]/ACT[Act]/(owl:sameAs)*
where the test and the action are specified as follows:
Test=ASK {FILTER EXISTS{?ctx rdf:type dbyago:AmericanFilmActors}}
Act=sdEmail(“x@y.z”,“SELECT ?p WHERE {?ctx foaf:isPrimaryTopicOf ?p.}”)
The traversal of the dbpo:writer predicate enables to reach movies whose script
has been written by J. Grisham; whence actors of these movies are reached via the
predicate dbpo:starring. The (implicit) variable ?ctx in Act is bound to the set of
URIs dereferenced at the current step; actors that played in a movie written by J.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:7
Grisham in this case. At this point, the test on the nationality of such actors via
the ASK SPARQL query enables to rule out actors that are not American. This fil-
ter leaves in this case only dbpedia:John Cusack. Over the elements of this set (one
element in this case), the action will send via email the Wiki page (obtained via
the SELECT query). The action sdEmail, implemented by an ad-hoc programming
procedure, does not influence the navigation process. Thus, the evaluation will con-
tinue from the URI dbpedia:John Cusack, by traversing the edge owl:sameAs (found
in the set of triples obtained by dereferencing dbpedia:John Cusack), similarly to
what already seen in Example 1.1. The final result of the evaluation is: (i) the set
{dbpedia:John Cusack, fb:John Cusack, nyt:. . . 13834}, that is, the URIs identifying
John Cusack in dbpedia.org,freebase.org and nyt.com; (ii) the set of actions per-
formed; in this case one email sent. The previous example underlines two of the main
features of NAUTI LOD. First, it enables via tests to declaratively specify routes in the
WLOD graph that have to be explored. Second, navigation can be complemented with
high level specification of actions to be performed over data. The combination of these
two features is a powerful support for Web developers toward building applications
that consume structured data on the WLOD.
We are now ready to show an example of how with NAUTILOD it is possible not only
to specify a set of nodes conforming to an expression, but also to keep information about
the Web fragment where these nodes are located. Connectivity information available
in Web fragments is crucial in several application scenarios such as social networks
and citation networks. On the Web, connectivity is also useful to track provenance,
that is, reconstructing the path that led to a particular datasource.
EXA MPL E 1.3. (Capturing Web fragments.) Specify the co-author network of Tim
Berners-Lee (TBL) by considering co-authorship relations on papers published between
2010 and 2013 only. The specification in NAU TILOD is:
dblp:Tim Berners-Lee (foaf:maker[Test]/foaf:maker)*
where the test is defined as follows:
Test=ASK {?ctx dc:issued ?y. FILTER( ?y>‘‘2010’’∧∧<xsd:gYear>).
FILTER(?y<‘‘2013’’∧∧<xsd:gYear>).}
The purpose of Example 1.3 is that of constructing a network starting from the URI
of TBL in the RDF version of the DBLP bibliography database7. Hence, the goal is
twofold: (i) identifying nodes (co-authors of TBL) that satisfy the expression; (ii) keep-
ing track of the connections among these nodes. As we will discuss in Section 4, this
request goes beyond the scope of current navigational languages (e.g., SPARQL prop-
erty paths [Harris and Seaborne 2013] and nSPARQL [P´
erez et al. 2010]) that mainly
focus on identifying nodes.
In terms of expressiveness, we want to point out that SPARQL property paths cannot
express Example 1.3 since it lacks capabilities to perform tests `
a la XPath [Clark and
DeRose 1999]; in other words, it is not possible while evaluating a path to check condi-
tions over the nodes (in this case the year of the publication) encountered in the path.
An excerpt of the co-author network for Example 1.3 is shown in Fig. 3. To provide
a user friendly visualization, the swget tool implementing the NAUTI LOD language
features a GUI with a rich set of functionalities to zoom and explore Web fragments
and search for specific nodes and edges.
7http://dblp.l3s.de
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:8 V. Fionda et al.
Fig. 3. The fragment of the WLOD retrieved for Example 1.3.
1.5. Contributions
The following are the main contributions of this paper:
(1) We present a navigational language for the WLOD graph called NAUTILOD. It pro-
vides a unique combination of navigational and querying features and incorporates
a declarative mechanism to command actions over data.
(2) We define a simple syntax and a formal semantics that associates to a NAUT ILOD
expression the set of nodes (i.e., URIs) in the WLOD graph that satisfy the expres-
sion plus the set of actions performed.
(3) We formalize an enhancement of the semantics of NAUT ILOD to capture Web frag-
ments, that is, graphs composed by nodes visited when evaluating expressions along
with edges traversed to reach these nodes.
(4) We provide algorithms and study the complexity of evaluating NAUT ILOD expres-
sions according to both semantics.
(5) We discuss an implementation of the language in the swget tool, which is readily
available on the WLOD. swget is available as an API, a GUI and a Web portal.
(6) We analyze how the different features (i.e., navigation, tests, actions) affect the cost
of the evaluation of NAUTI LOD expressions.
(7) We compare NAUT ILOD (and swget) with the state of the art.
Some of the ideas presented in this paper appeared in proceedings of the WWW2012
conference [Fionda et al. 2012]. This article considerably expands our previous work
in the following respects:
(1) We deepen the motivations underlying the NAUT ILOD language (also on the light
of recent proposals such as the Google Knowledge Graph and Facebook Graph) and
improve its presentation.
(2) We introduce two new formal semantics for the NAUT ILOD language that allow to
capture Web fragments.
(3) We present algorithms for evaluating NAUT ILOD expressions according to the new
semantics, study their complexity and discuss their implementation.
(4) We perform a detailed evaluation of swget and compare it with related research.
(5) We discuss examples of Web fragments with real data.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:9
1.6. Organization of the paper
Section 2 provides some background. Section 3 introduces the NAUTI LOD language;
here the syntax and the semantics returning sets of nodes are introduced. Section 4
discusses the new semantics of NAUTI LOD capturing Web fragments. Section 5
presents an implementation of NAUTI LOD in the swget tool. An experimental eval-
uation of swget is discussed in Section 6. Section 7 discusses related research. Finally,
in Section 8 and Section 9 we draw some conclusions.
2. PRELIMINARIES
This section provides some background on the key notions underlying our proposal.
2.1. RDF and the Linking Open Data project
The Resource Description Framework (RDF) is a metadata model introduced by the
W3C for representing information about Web resources. RDF leverages Uniform Re-
source Identifiers (URIs) to identify resources; URIs represent global identifiers in
the Web and enable to access the descriptions of resources according to specific pro-
tocols (e.g., HTTP). RDF builds upon the notion of statement. A statement defines
the property pholding between two resources, the subject sand the object o. A state-
ment is denoted by (s, p, o), and thus called triple in RDF. As an example, in Fig. 2
(dbpedia:John Grisham,dbpo:writer,dbpedia:Runaway Jury)is an RDF triple. RDF
triples make usage of vocabularies (ontologies) that model domains of interest. In the
previous triple, the predicate dbpo:writer is defined in the DBpedia ontology and ex-
presses the relation between a person and a piece of work. A collection of triples is
referred to as RDF graph. As discussed in Section 1.1, the availability of structured
information (in RDF) enables a new (semantic) space where objects are linked and
looked-up by using (Semantic) Web languages and technologies. This space can be
thought of as a giant global graph [Berners-Lee 2006] that we will refer to as the Web
of Linked Data (WLOD). In order to publish and interlinking data in the WLOD, the
Linking Open Data (LOD) project [Berners-Lee 2006] sets some informal principles:
(1) Real world objects or abstract concepts must be assigned names on the form of URIs.
(2) HTTP URIs have to be used so that people can look them up by using existing
technologies.
(3) When someone looks up a URI, associated information has to be provided in a stan-
dard form (e.g., RDF).
(4) Interconnections among URIs have to be provided via references to other URIs.
A central notion in this context is that of dereferenceable URI, that is, a URI that
when looked up (via an HTTP GET) provides the representation of the resource it
identifies in a standard data format (e.g., RDF). Data in the WLOD are provided
by publishers that maintain thousands of interlinked datasets containing billions of
facts covering diverse domains, such as general knowledge provided by the New York
Times, DBpedia and Freebase/Google; music provided by the BBC; science provided by
Nature; geospatial information provided by Geonames and OpenStreetMap; or public-
sector information provided by the US and UK governments and European agencies.
Despite the huge availability, there is still scant consumption of this enormous
wealth of structured, freely available information. Our proposal of a navigational lan-
guage for the WLOD is meant to help in filling this gap by providing developers and
Web users with a mechanism to specify, retrieve and act over structured information
on the Web.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:10 V. Fionda et al.
2.2. Data model
This section introduces a minimal abstract model of the WLOD that highlights the
main features required in the subsequent discussion. Let Ube the set of all URIs
and Lthe set of literals, that is, strings or special datatypes. We distinguish between
two types of RDF triples. RDF links (s, p, o) U × U × U that encode connections
among resources in the WLOD and literal triples,(s, p, o) U × U × L that are used
to state descriptions or features of the resource identified by the subject s. Note that
the subject/object of a triple, in the general case, can be also a blank node, which for
most purposes could be thought of as an existential variable. However, here we will
not consider them (note also that the usage of blank nodes is discouraged [Heath and
Bizer 2011]). The following three notions will be fundamental.
DEFINITION 2.1 (WEB OF LINKED DATA T).Let Uand Lbe as described above
(infinite sets). The Web of Linked Data (over Uand L) is the set of triples (s, p, o)in
U × U × (U ∪ L). We will denote it by T.
DEFINITION 2.2 (DESCRIPTION FUNCTION D).The description function D:U →
P(T)associates to each URI u∈ U a subset of triples of T. By D(u), we denote the set of
triples obtained by dereferencing u.
DEFINITION 2.3 (WEB OF LINKED DATA INSTANCE W).A WLOD instance is a
pair W=hU,Di, where Uis the set of all URIs and Dis a description function.
In a WLOD instance what matters are those u∈U for which D(u)6=. Note that
the description function Dcould return the empty set for some u(e.g., if uis
not dereferenceable). To give a concrete example of how the description func-
tion models the process of dereferencing, consider the URI fb:Runaway Jury in
freebase.org shown Fig. 2. We have that D(fb:Runaway Jury)returns the set of triples
{(fb:Runaway Jury,fb:starring,fb:John Cusack),(fb:Runaway Jury,fb:starring,
fb:Dustin Hoffman),(fb:Runaway Jury,fb:genre,‘‘Crime fiction’’)}.
From now one we will denote a WLOD instance simply via WLOD. Fig. 4 provides
a pictorial representation of the Web of documents and the WLOD. The Web is tra-
ditionally modeled as a graph of syntactic links between pages that maintain un-
structured information. The traditional way of accessing information in such graph
is via crawlers. The WLOD can be represented as a set of nodes plus data describing
their semantic structure attached to each node and their interlinks. Effectively, a node
represents a datasource (set of RDF triples) identified by a URI with links to other
datasources. These new features enable to evolve crawling toward more sophisticated
forms of semantic navigation that can be specified via formal languages such as the
NAUT ILOD language introduced in this paper.
3. NAUTILOD: SEMANTIC NAVIGATION ON THE WEB OF LINKED DATA
This section presents the NAUTI LOD (Navigational Language for Linked Open Data)
language. The first goal of NAUTI LOD is to enable the declarative specification of
nodes of the Web of Linked Data (WLOD) graph by leveraging semantic information
available in both nodes and edges. NAU TILOD incorporates two main features. First,
given a high level conceptual specification of some semantic nodes in the WLOD graph
(e.g., Italian directors influenced by S. Kubrick), it provides a way to drive the navi-
gation toward the relevant places where to find them. Second, NAUTI LOD enables to
command actions over data encountered during the navigation (e.g., send via email the
homepages of all the French directors encountered). Although there have been several
proposals for accessing data in the WLOD (see Section 7), they differ from our proposal
in the following main respects: (i) they treat navigation as second-order citizen in the
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:11
The Web
<2>
<3>
1
<4>
2
<5>
3
<3>
<5>
4
5
Search/Query
<2>
<3>
<4>
1
2
<3>
<5>
<5>
3
4
5
href
The Web of Linked Data
r
(uri2,q,uri3)
(uri1,p,uri2)
...
uri2
(uri1,r,uri4)
(uri2,p,uri1)
...
uri1
(uri3,m, uri4)
(uri1,n, uri3)
...
uri3
(uri4,t,uri2)
...
uri4
pq
ntm
Description of uri1
Description of uri2
Description of uri3
Description of uri4
<-> : hyperlink
dereferencing
uri1
(uri1,r,uri4)
(uri2,p,uri1)
...
Fig. 4. The Web vs the Web of Linked Data.
sense that such proposals do not feature neither explicit constructs to perform seman-
tic navigation on the WLOD nor functionalities to retrieve fragments of it; (ii) none
of them incorporate mechanisms to combine navigation and actions over data. The
formalization of NAUTI LOD has been inspired by two non-related proposals, that is,
wget, a tool to automatically navigate and retrieve Web pages, and XPath [Clark and
DeRose 1999], a language to specify parts of a document in the world of semistructured
data.
3.1. Syntax of NAU TILOD
The syntax of NAUTI LOD is defined according to the grammar reported below. The
navigational core of NAUTI LOD is based on regular path expressions [Wood 2012],
similarly to Web query languages (e.g., [Mendelzon et al. 1997; Abiteboul and Vianu
1997]) and XPath [Clark and DeRose 1999]. The navigation in the WLOD can be con-
trolled via existential tests in the form of ASK-SPARQL queries [Harris and Seaborne
2013]. This mechanism allows to redirect the navigation based on the information
present at each node of the navigational path. The language also allows to command
actions during the navigation according to decisions based on the original specification
and data available in the datasources (nodes in the WLOD) visited.
path ::= pred |predˆ|action |path/path |path∗ |
path|path |path[test]|pathhl-hi
pred ::= RDF predicate | h i
test ::= ASK-SPARQL query
action ::= ACT[procedure(target, SELECT-SPARQL query)]
NAUT ILOD is based on Path Expressions; it accepts concatenations of basic and com-
plex types of expressions. Basic expressions are predicates (pred) and actions (action);
complex expressions are concatenations and disjunctions of expressions; expressions
involving repetitions using the features of regular languages [Hopcroft et al. 2000]
(i.e., the Kleene * operator corresponding to zero or more repetitions and the hl-hiop-
erator meaning at least land at most hrepetitions); and expressions followed by a
test. The building blocks of an expression reflect the classical functionalities of navi-
gational languages as described in what follows:
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:12 V. Fionda et al.
(1) Predicates: pred is an RDF predicate or the wild card < > denoting any predicate.
(2) Test Expressions: Atest is given in the form of an ASK-SPARQL query used to
perform an existential test at a datasource (node) in the WLOD.
NAUT ILOD expressions also include the additional feature of actions:
(3) Action Expressions: An action is a procedural specification of a command (e.g.,
send a notification message). Actions neither influence the navigational process nor
change the underlying data.
Let us elaborate a bit more on the support of actions. NAU TILOD expressions enable
to specify certain basic types of actions, more in the line of side-effects than updates.
Indeed, in addition to compute the result (set of nodes or Web fragments) an expres-
sion (via actions) can act on the outside world (e.g., users or devices). It is important
to point out that the kind of actions supported by NA UTI LOD are not meant to trans-
form datasources (e.g., as in [Abiteboul et al. 2002]); their main purpose is to play the
role of meta-communication means that enable to perform basic communications (e.g.,
sending emails with some data). Parameters values of an action are obtained from
the datasource (node) reached during the navigation via a SELECT-SPARQL query.
Moreover, the special parameter target is specified (by the user) to indicate the com-
munication channel (e.g., an email address, a mobile phone number, a printer name).
A scenario where actions result useful is when one not only is interested in retrieving
particular pieces of information (e.g., her co-authors, her co-author network) but also
in receiving additional information that has been “encountered” during the evaluation
and is not part of the result. An example of such reasoning is the list of nationalities
of co-authors that can be obtained when building a co-author network where only the
names of co-authors (and their links) are specified to be part of the network retrieved.
If restricted to (1) and (2), NAUTI LOD can be seen as a declarative language to spec-
ify sets of datasources (nodes) or fragments of the WLOD conform to some semantic
specification. We now introduce the formal semantics of NAUTILOD capturing nodes.
3.2. NAU TILOD semantics returning set of nodes
NAUT ILOD expressions are evaluated against a WLOD instance W=hU,Di from a seed
node (i.e., a URI u) that represents the starting point of the navigation in the WLOD
graph. The evaluation of an expression produces the set of nodes of the WLOD that are
reachable from the seed via paths satisfying the expression, plus the set of actions trig-
gered during its evaluation. The fragment of the language without actions follows the
lines of formalization of XPath by Wadler [Wadler 1999]. The semantics of NAUT ILOD
is reported in Table I, where Wis omitted for sake of conciseness. The semantics has
the following modules:
SemJpathK(u): This function takes as input a path (i.e., an expression defined ac-
cording to the the syntax in Section 3.1) and a URI u. The function gives as output
the ordered pair of two sets: the set of URIs reached by the evaluation of path; and
the set of actions performed during the evaluation. The function Sem is the first to
be called when evaluating a NAUTI LOD expression. It uses the function Uto obtain
the set of URIs while the set of actions performed is obtained via the function EA.
UJpathK(u): This function takes as input a path, which is evaluated starting from
the URI uin the WLOD instance W, and gives as output the set of URIs reachable
via sequences of edges satisfying path. In Table I, rules R2-R10 describe the result
produced by Uwhen considering the different types of path that can be defined
according to the NAUTI LOD syntax reported in Section 3.1.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:13
EAJpathK(u): This function is responsible for the execution of the actions gener-
ated during the evaluation of a path starting from the URI u. In Table I, rule R11
shows that the function EAcalls an additional function Exec(a, D(u)). The function
Exec in rule R11 takes two parameters: an action aof the form procedure(target,
SELECT-SPARQL query) and the description of a datasource. Exec proceeds as follows:
runs the SELECT-SPARQL query over D(u)and communicates the results obtained
to the target. We again want to emphasize that actions are side effects and do not
update datasources. Note also that rule R11 calls the function Ato obtain the pairs
(u, a)to be given as input to Exec.
AJpathK(u)This function takes as input the different types of path (see Section 3.1)
and gives as output the set of actions associated to the URIs visited during the
evaluation of path starting from the URI u. Rules R12-R20 in Table I cover all the
possible cases. Note that not all the types of path produce actions; indeed, rules R12-
R14 return the empty set since no action is syntactically specified. On the other
hand, rule R15 is the particular type of path that returns an action. For collecting
the actions inside a complex path, rules R16-R20 are used.
Table I. Semantics of NAU TIL OD returning nodes. The rules reflect all the possible forms of syntactic expressions
that can be given according to the NAU TILOD syntax in Section 3.1. The evaluation of an expression occurs by
calling the function Sem (rule R1). It produces two sets: (i) the set of URIs of Wsatisfying the expression, obtained
via the function U;(ii) the set of performed actions obtained, via the function EA, when evaluating the expression.
Exec(a, D(u)) (rule R11) denotes the execution of action aover the datasource identified by u.
R1 SemJpathK(u,W)=(UJpathK(u), EAJpathK(u))
R2 UJpK(u)={u0|(u,p,u0)∈ D(u)}
R3 UJpˆK(u)={u0|(u0,p,u)∈ D(u)}
R4 UJ< >K(u)={u0| ∃p,(u,p,u0)∈ D(u)}
R5 UJactionK(u)={u}
R6 UJpath1/path2K(u)={u00 UJpath2K(u0) : u0UJpath1K(u)}
R7 UJ(path)K(u)={u} ∪ S
1UJpathiK(u) : path1=path pathi=pathi1/path
R8 UJ(path)hl-hiK(u)={u} ∪ Sh
lUJpathiK(u) : path1=path pathi=pathi1/path
R9 UJpath1|path2K(u)=UJpath1K(u)UJpath2K(u)
R10 UJpath[test]K(u)={u0UJpathK(u) : test(u0) = true}
R11 EAJpathK(u)={Exec(a, D(u)) : (u, a)AJpathK(u)}
R12 AJpK(u)=
R13 AJpˆK(u)=
R14 AJ< >K(u)=
R15 AJactionK(u)={(u,action)}
R16 AJpath1/path2K(u)=AJpath1K(u)Su0UJpath1K(u)AJpath2K(u0)
R17 AJ(path)K(u)=S
1AJpathiK(u) : path1=path pathi=pathi1/path
R18 AJ(path)hl-hiK(u)=Sh
lAJpathiK(u) : path1=path pathi=pathi1/path
R19 AJpath1|path2K(u)=AJpath1K(u)AJpath2K(u)
R20 AJpath[test]K(u)=AJpathK(u)
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:14 V. Fionda et al.
3.3. Examples of expressions returning nodes
To clarify the semantics of NAUTILOD, we now discuss some examples. Consider the
expression consisting in the sole predicate rdf:type evaluated starting from the URI
u. The evaluation starts by applying rule R1 in Table I. Such rule requires to use the
function U(to obtain the set of URIs) and the function EA(to trigger the set of actions).
The function Uuses rule R2 since it matches the syntactic expression in input, which
consists of the predicate rdf:type. Rule R2 returns the set of URIs reachable from
uby traversing edges labeled as rdf:type. This is done by inspecting triples of the
form (u,rdf:type,u0)included in D(u). As for actions, rule R1 calls rule R11, which
in its turn call rule R12; even in this case rule R12 is the only rule that matches the
expression in input (i.e., rdf:type). Note that since no actions have been specified, rule
R12 returns the empty set.
Consider now the evaluation of the expression rdf:type[q], which includes a test q
(an ASK-SPARQL query). The first part is similar to the previous example, that is,
it uses rule R1 and then rules R10 and R2. Rule R2 evaluates rdf:type as discussed
above and returns a set of URIs. This set is filtered by rule R10, which is used because
it matches the syntactic expression in input containing the test [q]. The filtering per-
formed by using rule R10 discards those URIs u0obtained via rule R2 for which the
ASK query qevaluated on their descriptions D(u0)returns false. Also in this example
the set of actions is empty.
Finally, consider a more complex expression rdf:type[q]/a, which also includes the
specification of an action a. The evaluation starts again by considering the rule R1. The
set of URIs that satisfy the expression is obtained by applying the rules R6, R10, R2
and R5. The subexpression rdf:type[q] is evaluated as previously discussed (rules R10
and R2). The evaluation of the action ais done by using rule R5, which simply returns
the whole set of URIs obtained from the evaluation of the previous subexpression. This
behavior points out how actions do not interfere with the navigation.
The evaluation of actions occurs by applying the rules R16, R20, R12, R15 and their
execution is managed via rule R11. By looking at Table I, it can be noted that before
applying rule R15 the set of actions is the empty set. Then, according to rule R15
the action ais executed for each URI in the evaluation of rdf:type[q]. Overall, the
evaluation of the expression rdf:type[q]/aconsists in: (i) the set of URIs obtained by
evaluating rdf:type[q]and;(ii) the action aperformed on each URI u0belonging to the
previous set (possibly using some data from D(u0)). We are now ready to introduce an
automaton-based algorithm for the evaluation of NAUTI LOD expressions.
3.4. Evaluating NAUT ILO D expressions: algorithms and complexity
This section describes an automaton-based algorithm for the evaluation of NAUTI LOD
expressions according to the semantics shown in Table I. The usage of automata has a
twofold benefit. First, since NAUTI LOD is based on regular expressions, for each NAU-
TILOD expression ewe can construct the Nondeterministic Finite-state automaton
(NFA) [Hopcroft et al. 2000] that recognizes strings belonging to the language defined
by e. Second, the NFA associated to an expression can be given an intuitive graphical
representation, which simplifies the presentation of the algorithm. Before getting into
the technicalities, we give a high level overview of the algorithm.
Let ebe a NAUTI LOD expression and Wa WLOD instance. Given the automaton Ae
associated to an expression e, the evaluation algorithm builds the product W × Aebe-
tween the automaton Aeand the data graph (the WLOD instance W). Note that since
the data graph Wis not available in its entirety, W ×Aehas to be built incrementally
while traversing W. The result of the evaluation of a NAUTI LOD expression eis the
set of nodes in W × Aemarked with a final state of Ae.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:15
Algorithm 1: construct(NAUTILOD expression e, URI seed)
Input: NAUTILOD expression e, seed URI seed
Build:W × Ae=(Q0,ΣeΛ, δ0,(seed, q0), F 0)product automaton
1compute eT, and let Λand be the sets of labels used for tests and actions
2build the NFA Ae= (Q, ΣeΛ, δ, q0, F )
3initialize the set of states of W × Aeto {(seed, q0)}
4initialize the set of transitions of W × Aeto
5for each state (u, q) W × Aedo
6for each transition δ(q, x)∈ Aedo
7if xΣethen
8for each q0δ(q, x)AND (u, x, u’)∈ D(u)do
9add (u0, q0)to the set of states of W × Ae
10 add (u0, q0)to δ0((u, q), x)
11 if q0Fthen
12 add (u0, q0)to F0
13 else if (xΛand evalTest(x,D(u))=true) or
(x=ε) or (x)then
14 for each q0δ(q, x)do
15 add (u, q0)to the set of states of W × Ae
16 add (u, q0)to δ0((u, q), x)
17 if q0Fthen
18 add (u0, q0)to F0
19 return W × Ae
3.4.1. Algorithm: Incremental Product Automaton Construction.
Let Σebe the set of RDF predicates that appear in eand |e|the size of the expression
being evaluated. We now introduce the notion of core expression. The expression eTis
the core expression obtained from eby replacing each test with a fresh symbol α /Σe
and each action with a fresh symbol β /Σe. Let Λbe the set of symbols used for tests
and the set of symbols used for actions (the sets Σe,Λand are pairwise disjoint).
The cost of the translation of an expression einto a core expression eTis O(|e|)(it can
be done with a scan of the input expression e).
The Nondeterministic Finite-state Automaton (NFA) Aecorresponding to eTis a
tuple (Q, ΣeΛ, δ, q0, F ), where Qis the set of states of Ae,δ:Q×eΛ{ε})
2Qthe transition function, q0Qthe initial state and FQthe set of accepting (final)
states. Aecan be built with costs O(|eT|) = O(|e|)following the Thomson’s construction
rules [Hopcroft et al. 2000]. We assume that Aeis stored by using two adjacency-lists
that make explicit the navigation of transitions and their inverses. This representation
uses space O(|Ae|) = O(|e|). We also assume that given an element qQ, we can access
its associated list of transitions in time O(1).
Given a WLOD instance W, we indicate with Wethe portion of Wnavigated when
evaluating the expression e. We denote by uris(We)the set of URIs in Weand by
triples(We)its set of triples. Moreover, seed is the node in W(i.e., a URI) where
the navigation, for the evaluation of e, starts. The product automaton W × Aeis built
according to Algorithm 1. In what follows we clarify the phases of the algorithm.
(1) Initialization: W × Ae= ({(seed, q0)},ΣeΛ,,(seed, q0),), the product au-
tomaton at the beginning only contains the initial state (seed, q0)and its transition
function is empty (lines 3-4).
(2) For each state (u, q)of W × Ae(line 5), all the transitions originating from qin Ae
are considered (line 6). Each state of W × Aeis considered exactly once.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:16 V. Fionda et al.
(3) The state (u0, q0)and the transition ((u, q),p,(u0, q0)) are added to W ×Aeif and only
if q0δ(q, p),pΣeand the triple (u,p,u0)∈ D(u), where D(u)is the description of
uobtained by dereferencing u(lines 8-12).
(4) The state (u, q0)and the transition ((u, q), ε, (u, q0)) are added to W × Aeif and only
if q0δ(q, ε)(lines 14-18).
(5) The state (u, q0)and the transition ((u, q),test,(u, q0)) are added to W × Aeif and
only if q0δ(q, test),test Λand test evaluates true over D(u)(lines 14-18).
(6) The state (u, q0)and the transition ((u, q),action,(u, q0)) are added to W ×Aeif and
only if q0δ(q, action),action (lines 14-18).
EXAMPLE 3.1. (Incremental Product automaton). Fig. 5 (a) shows an excerpt
of real-word data (W). Fig. 5 (b) shows an expression ealong with the associated NFA
Ae. Figs. 5 (c)-(f) describe the steps necessary to build the product automaton W ×Ae.
Fig. 5 (c) shows the product automaton after the initialization (Algorithm 1, lines
3-4). Fig. 5 (d) show the partial automaton after the first iteration of the for (lines
5-18) when the state (dbpedia:John Grisham, q0)is considered. In particular note that
the only transition in Aestarting from q0is labeled by εand leads to the state q1.
This corresponds to add to W × Aethe state (dbpedia:John Grisham, q1)and one ε-
transition. When the state (dbpedia:John Grisham, q1)is considered, the two transi-
tions in Aefrom q1to qflabeled with foaf:primaryTopic and from q1to itself labeled by
owl:sameAs allow to add the two states (wiki:John Grisham, qf)and (nyt:...64673, q1)
and the respective transitions. The partial result is shown in Fig. 5 (e). When
(nyt:...64673, q1)is considered the transition labeled with foaf:primaryTopic allows
to add the state (nyt:...13843, qf)while the transition labeled with owl:sameAs does
not have any effect since there are no outgoing owl:sameAs-edges originating from
nyt:...13843 in W. The result is reported in Fig. 5 (f). Note that when the two states
(wiki:John Grisham, qf)and (nyt:...13843, qf)are considered, W × Aeremains un-
changed since the state qfhas no outgoing transitions in Ae.
ɛ
q0
q1
qf
owl:sameAs
foaf:
primaryTopic
ɛ
dbpedia:John Grisham ,q0
owl:sameAs
foaf:
primaryTopic
foaf:
primaryTopic
WAe
Ae
(b)
(e)
(f)
dbpedia:John Grisham ,q1
nty:...64673,q
1
ɛ
dbpedia:John Grisham ,q0
owl:sameAs
foaf:
primaryTopic
dbpedia:John Grisham ,q1
nty:...64673 ,q1
(d)
ɛ
dbpedia:John Grisham ,q0
dbpedia:John Grisham ,q1
(c)
dbpedia:John Grisham ,q0
nyt:...13843 ,qf
wiki:John Grisham ,qf
wiki:John Grisham ,qf
dbpedia:
John Grisham
foaf:
depiction
dbpedia:Jonesboro
Arkansas
dbpo:
birthPlace
foaf:
primaryTopic
nyt:
...
64373
owl:
sameAs
nyt.com
foaf:
primaryTopic
wiki:John
Grisham
nyt:
...13843
dbpedia.org (a)
e=(owl:sameAs)/foaf:primaryTopic
Fig. 5. (a) A WLOD instance W; (b) The automaton Aefor the expression in Example 1.1; (c)-(f)
steps necessary for the construction of the product automaton W × Aestarting from the seed node
dbpedia:John Grisham.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:17
THEOREM 3.2. The product automaton W×Aecan be built in time O(|We|×|e|)+CT
using Algorithm 1, where CTis the cost of the evaluation of tests.
PROOF. At the end of the execution of the Algorithm 1, we have that Q0
uris(We)×Q. Since in the construction of the product automaton each state of W ×Ae
is considered only once, the first for loop (line 5) costs O(|Q0|)=O(|uris(We)|×|Q|). For
each state (u, q)of W ×Aethe outgoing transitions of qin Aeare considered so that the
inner for loop (line 6) is executed a different number of times for each state. In partic-
ular, each transition (q, x, q0)is examined only when considering the states (, q)Q0.
Since the same state of Qcan be examined at most once for each URI in We, each
transition (q, x, q0)is examined at most |uris(We)|times. Hence, the code inside the
two for loops (lines 7-18) is executed at most O(|uris(We)| × |δ|)times. Inside the for
loops, all the states q0reached by the current transition are considered. When consid-
ering the transitions altogether, the entire automaton Aeis visited. Moreover, triples
belonging to the description of the current URI uare used to compute the states and
transitions to be added to the product automaton W × Ae. This can be done in time
O(|D(u)|)(the first time a URI is encountered, triples in D(u)are stored so that every
subsequent check for a specific predicate can be done in constant time). When consider-
ing the URIs of Wealtogether this step costs O(|triples(We)|). Thus, the total cost of
the algorithm is O((|uris(We)|+|triples(We)|)×|Ae|) = O(|We|Ae|) = O(|We|×|e|).
As for the transitions in δlabeled with a symbol αΛ(i.e., the tests) the above
complexity bound has to be refined by considering the cost CTof evaluating tests. Each
test is considered at most once for each URI. Note that at most a number of transitions
equals to the number of triples in |triples(We)|×|Q|is added to W × Aewhen dealing
with predicates (lines 9-10) and at most additional |uri(We)|× |Ae|when dealing with
tests, εtransitions or actions (lines 15-16). Thus, the maximum number of transitions
in the product automaton is |δ0|=O(|We| × |Ae|) = O(|We|×|e|).
The complexity of the algorithm is parametric in CT, that is, the cost of evaluating
tests. NAUTILOD tests are ASK-SPARQL queries; hence, their cost can be controlled
by choosing a particular fragment of SPARQL [P´
erez et al. 2009]. However, note that
the usage of tests enables to (possibly) reduce the size of the set of nodes visited dur-
ing the evaluation. After having constructed W × Ae, the results of the evaluation
of the NA UTI LOD expression can be obtained by running Algorithm 2, which checks
the final states. The final result is given in terms of (i) the set of URIs reachable by
paths satisfying the expression and (ii) the set of actions that have been executed. The
cost CAof executing actions has to be added to the cost of building W × Aevia Algo-
rithm 1 and checking final states via Algorithm 2. Finally, note that W × Aecontains
O(|uri(We)| × |Ae|) = O(|uri(We)|×|e|)transitions labeled with α(i.e., actions).
EXA MPL E 3.3. (Computation of result URIs). By considering the product au-
tomaton W × Aein Fig. 5 (f), the results are wiki:John Grisham and nyt:...13843,
that is, the URIs of the WLOD reported in Fig. 5 (a) reached at some final state of
W × Ae. These results were discussed in Example 1.1.
The algorithms introduced above enable to identify (and return) the sets of nodes
in the WLOD that conform to a specification given in the NAUTI LOD language and
execute the set of actions specified. Note that the same results could have also been
computed without using Algorithm 2; in this case, results can be collected as soon as a
final state is added to W × Ae(see Algorithm 1). Moreover, actions can be handled in
different ways: either by collecting the pairs (u,action) when the corresponding state
is added to W×Aeand executing the whole set as last step, or by executing each action
(u,action) as soon as the corresponding state is added to W × Ae. In the next section
we extend the scope of the semantics of NAUTI LOD to capture Web fragments.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:18 V. Fionda et al.
Algorithm 2: evaluate(NAUTILOD expression e, URI seed)
Input: NAUTILOD expression e, seed URI seed
Build:Rset of URIs
Uses :W × Ae= (Q0,ΣeΛ, δ0,(seed, q0), F 0)
1W × Ae=construct(e, seed)/* see Algorithm 1 */
2for each state (u, q)F0do
3add uto R
4for each transition δ0((u, q),action)such that action do
5execute action over D(u)
6return R
4. CAPTURING WEB FRAGMENTS
We have described NAUTILOD as a language that enables the specification of: (i) sets
of nodes in the WLOD graph; (ii) actions to be performed over data encountered during
the navigation. We have presented a formal semantics (see Section 3.2) and algorithms
(see Section 3.4) to evaluate NAUTI LOD expression. The possibility to specify (and
retrieve) sets of nodes misses information about the fragment of the Web navigated to
reach these nodes; in other words, connectivity information is lost. Indeed, navigation
is only referenced in the semantics as the means to get the resulting set of nodes. Such
behavior is common to many navigational languages (e.g., XPath [Clark and DeRose
1999], nSPARQL [P´
erez et al. 2010]). However, connectivity information is crucial in
some contexts [Fionda et al. 2014a; Fionda et al. 2014c].
Consider the Web user Nick who wants to collect a fragment of the FOAF social
graph consisting in people he knows directly or indirectly that have his same musi-
cal preferences. While it would not be a problem to specify and retrieve this set of
people with NAUTI LOD, it is not possible to provide information about the structure
(connections) of the fragment of the FOAF graph where Nick is linked with these peo-
ple. Another domain where connectivity is crucial is that of bibliographic/citation net-
works. Consider a researcher interested in a new topic who wants to collect relevant
bibliographic references. She can specify papers that cite a given seed paper and are
published at a specific conference. While it is possible to automatically retrieve via
NAUT ILOD the set of papers linked with the seed, connectivity information is lost.
In a distributed environment such as the Web, keeping connectivity information is
very useful for provenance tracking [Gil and Groth 2011], that is, understanding from
which datasources pieces of information in a path come from. Motivated by these sce-
narios, we formalize two semantics of the NAU TILOD language to capture fragments
of the Web, that is, graphs including the sets of nodes satisfying an expression along
with edges connecting these nodes. From a high level perspective, the main challenge
that we are going to face is the fact that now the evaluation of a NAUTILOD expres-
sion has to deal with graphs representing Web fragments instead of set of nodes, thus
requiring the definitions of both the structure of such graphs and new operators to
combine them. Although our discussion is centered on NAUTI LOD and the WLOD
graph, the following results can also be generalized to other navigational languages.
4.1. Multipointed Graphs (MPGs)
In this section we introduce a particular type of structure called Multipointed Graph
(MPG). Essentially, an MPG is the structure that we use to formally capture a Web
fragment. MPGs, as it will be discussed later on in this section, have an algebraic
structure that allows composition, manipulation and reuse.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:19
An MPG Γis a quadruple (V, E , s, T)where (V, E)is a standard directed graph,
sVis a seed node (starting node) and TVis a set of ending nodes (result nodes).
To access the elements of Γwe use the notation Γ.x, where x∈ {V, E, s, T }. Let Rbe
the set of nodes (URIs) obtained by evaluating a NAUTI LOD expression starting from
the node (URI) seed according to the semantics returning sets of nodes; this semantics
can be captured via the MPG ({seed} ∪ R, , seed, R). The advantage of MPGs is that
they can additionally store the graph visited while evaluating an expression.
The semantics of NAUTI LOD described in Section 3.2 deals with sets of nodes; in-
deed rules in Table I, that consider all the types of syntactic expressions defined accord-
ing to the NAUTI LOD syntax presented in Section 3.1, construct sets and manipulate
such sets by using standard operators such as set membership (e.g., rule R6 in Table I)
and union (e.g., rule R8 in Table I). In order to define the semantics of NAUTILOD
returning fragments, it is necessary to deal with MPGs and associate to each syntactic
expression in Table 3.1 a set of MPGs. Moreover, when dealing with MPGs (and sets
of MPGs) it is necessary to: i) define operations over MPGs reflecting the standard
operations over sets of nodes; ii) devise specific additional functionalities.
DEFINITION 4.1. (OPERATIONS OVER MPGS). Let Γi= (Vi, Ei, si, Ti),i= 1,2be
MPGs and Γ= (,,,)denote the empty MPG, where is a special symbol not in
the universe of nodes.
(1)Concatenation
Γ1Γ2=Γif s2/T1,
(V1V2, E1E2, s1, T2)if s2T1.
(2)Union
Γ1Γ2=
(V1V2, E1E2, s1, T1T2)if s1=s2,
Γ1if s16=s2Γ2= Γ,
Γ2if s16=s2Γ1= Γ,
not defined if s16=s2Γ1,Γ26= Γ.
As it can be observed, the above operators applied to two MPGs produce another
MPG and need to be extended over sets of MPGs. Definition 4.2 formalizes such exten-
sion; here, the binary operator op ∈ {,}is applied to all pairs Γ1,Γ2such that Γ1
belongs to the first set and Γ2to the second one.
DEFINITION 4.2. (OPERATIONS OVER SETS OF MPGS). Let S1and S2be two sets
of MPGs.
(1)For each op ∈ {,}we define:
S1op S2={Γ1op Γ2|Γ1S1,Γ2S2}.
(2)(Disjoint union, direct sum)
S1S2={Γ|ΓS1ΓS2}.
The definition of concatenation over sets of MPG serves the purpose of encoding
concatenation of NAUTI LOD expressions. Disjoint union reflects the case of the union
operation defined for the semantics in Table I; the crucial difference is that now the
result is in terms of MPGs instead of nodes. As for the union, it encodes the standard
graph union operation with an additional condition on the starting nodes. The basic
(atomic) elements are singleton sets of MPGs. We will call them MPG atomic expres-
sions.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:20 V. Fionda et al.
PROPOSITION 4.3. The following algebraic identities between MPG atomic expres-
sions over a universe Un hold:
(1)The binary operation is commutative, associative and idempotent;
(2)The binary operation is commutative, associative and idempotent;
(3)The binary operation is associative;
(4)The following identities hold: (ab)c=acbcand c(ab) = cacb;(ab)c=
(ac)(bc)(same left side); (ab)c=acbc(same left side).
The identities follow easily from the definitions; using them we can get normal forms
for MPG expressions:
PROPOSITION 4.4. Any algebraic expression of MPGs over a fixed universe Un to-
gether with the above operators can be written in the following normal form, which is
crucial to enhance the semantics of NAUTI LOD to capture Web fragments:
Mn
i=1([ki
j=1(ei1···eiki))
where the eijare atomic MPGs expressions.
PROOF. The proof is a check that the rewriting system obtained by ordering the
identities in (4) from left to right, plus aaaand aaa, form a confluent and
terminating system modulo associativity and commutativity of the operators and
[Baader and Nipkow 1999]. The check of the critical pairs is straightforward in this
case.
Examples of operations over MPGs
Fig. 6 shows examples of MPGs and operations over (sets of) MPGs. Recall that the def-
inition of MPGs along with their operators is motivated by the fact that we want NAU-
TILOD expressions to deal with sets of MPGs instead of sets of nodes. In the example
of Fig. 6 it is shown the concatenation between the two sets of MPGs S1={Γ1,Γ2}and
S2={Γ3,Γ4}. Such operation, as it is shown in Table II (rule R7), is necessary to deal
with NAUTILOD expressions involving path concatenation (expressed in the NAU-
TILOD syntax with the symbol /). According to Definition 4.2, it is necessary to apply
the operator to the elements in the two sets pairwise thus obtaining a total of 4
concatenation operations as reported in Figs. 6 (a)-(d).
Consider the computation of Γ1Γ3. Here, Γ1.s=J. Ford, Γ1.T ={S. Kubrick, M. Scors-
ese}and Γ3.s=S. Kubrick, Γ3.T ={T. Burton, P. T. Anderson}, respectively. According
to Definition 4.1, the result of the operation is not empty if Γ3.s Γ1.T (which is the
case). The resulting MPG (reported in Fig. 6 (a)) will be composed by the union of the
two node and edge sets (i.e., Γ1.V Γ3.V and Γ1.E Γ3.E), the node Γ1.s as starting
node and the set Γ3.T as ending nodes. If we now consider the computation of Γ2Γ3, by
looking at Definition 4.1 and Fig. 6, it is easy to see that the result will be empty since
Γ3.s /Γ2.T . The operations between the remaining pairs can be computed similarly.
The final result (see Definition 4.2) will be the set of MPGs {Γ1Γ3,Γ1Γ4,Γ}.
4.2. NAU TILOD semantics returning Web fragments
We are now ready to introduce the two formal semantics of NAUTILOD that deal with
MPGs and operations over (sets of) MPGs. Abstractly speaking, an MPG captures both
nodes and edges traversed while evaluating an expression. In what follows we dis-
tinguish between V(VISITED) semantics and S(SUCC ESSFU L) semantics. The first
semantics returns the MPG consisting of all the nodes and edges navigated during
the evaluation of an expression while the SUCCESSFUL semantics returns the MPG
obtained as the union of all paths that satisfy the expression (i.e., all the successful
paths). Table II shows the two formal semantics.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:21
Martin
Scorsese
John
Ford
Orson
Welles
Stanley
Kubrick
Martin
Scorsese
Quentin
Tarantino
Stanley
Kubrick
Woody
Allen
David
Lynch
,
S1=
S1S2=,
,
Stanley
Kubrick
Tim
Burton
Woody
Allen
Paul T.
Anderson
Quentin
Tarantino
Martin
Scorsese
Paul T.
Anderson
Tim
Burton
Martin
Scorsese
John
Ford
Orson
Welles
Stanley
Kubrick
Woody
Allen
Paul T.
Anderson
S2=
Starting
node
Ending
node
Legend
2 3 41
1 4
1 3
=
1 3 =
1 4
=
2 3
=
2 4
,
(a) (b) (c)
(d)
Martin
Scorsese
John
Ford
Orson
Welles
Stanley
Kubrick
Paul T.
Anderson
Quentin
Tarantino
Tim
Burton
Martin
Scorsese
John
Ford
Orson
Welles
Stanley
Kubrick
Woody
Allen
Paul T.
Anderson Martin
Scorsese
John
Ford
Orson
Welles
Stanley
Kubrick
Paul T.
Anderson
Quentin
Tarantino
Fig. 6. MPGs and operations over (sets of) MPGs. Edges represent influenced relations from DBpedia.
The evaluation of an expression eover the WLOD instance Waccording to the new
semantics is performed starting from a seed node uaccording to the rules R1-R27 in
Table II. The evaluation returns an ordered pair composed of the MPG that captures
the fragment of W(according to either the VISITED or the SUCCESSFUL semantics)
and the set of actions performed during the evaluation. The main modules are the
functions EV(rule R1) and ES(rule R2) that represent the starting points for the
evaluation of a NAUTI LOD expression according to the VISITED or the SUCCESSFUL
semantics, respectively. The definition and application of the rules in Table II follows
the same reasoning described in Section 3.2 with the difference that now the functions
EV,V,ESand Sdeal with (sets of) MPGs. Indeed, the Vand Ssemantics are given by
considering all the types of syntactic expressions that can be defined according to the
NAUT ILOD syntax (see Section 3.1).
It is worth to point out that the evaluation of an expression always returns an MPG.
Indeed rules R1 and R2, which are the main modules, always perform the union of
all resulting MPGs, which gives an MPG as a result (see Definition 4.1). Moreover, also
note that the union is always defined since all resulting MPGs have the same starting
node (i.e., the seed node u). Indeed, the final MPG has uas starting node; moreover,
the ending nodes (i.e., the set T) are the nodes reachable by paths that satisfy the
expression. Actions are handled by the two functions EA(execution) and A(collection).
The crucial aspect of the new semantics is that, as we will show in Section 4.4, the cost
of obtaining the MPG is the same as that of the traditional semantics outputting only
the nodes reached by evaluating an expression.
4.3. Examples of NAU TILOD expressions returning Web fragments
We now provide an example of evaluation for the NAUTILOD expres-
sion e=dbpo:associatedBand/dbpo:genre according to both the VISITED
(V) and SUCCESSF UL (S) semantics on the excerpt of WLOD shown in
Fig. 7. We consider the node dbpedia:Eric Clapton as seed. Moreover,
for sake of space, we will adopt the following URI shorthands (also re-
ported in Fig. 7): dbpedia:Eric Clapton (EC), dbpedia:Plastic Ono Band (POB),
dbpedia:The Beatles (TB), dbpedia:The Rolling Stones (TRS), dbpedia:Dire Straits
(DS), dbpedia:Rock music (RM), dbpedia:Blues rock (BR), dbpo:associatedBand (AB),
dbpo:genre (G).
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:22 V. Fionda et al.
Table II. The VISITED (V) and SUCCESSFUL (S) semantics. The rules reflect all the possible forms of syntactic
expressions that can be given according to the NAUT ILOD syntax in Section 3.1. The rules for the syntactic expres-
sion pathhl-hiare omitted since they are simple adaptations of rules R8, R16 and R25 where the Sor the Lare
defined for the interval [l, h]instead of [1,]as already shown for the semantics returning nodes shown in Table I.
To access the elements of an MPG Γwe use the notation Γ.x where x∈ {V, E , s, T }.
R1 EVJpathK(u,W)=S
ΓVJpathK(u)Γ, EAJpathK(u))
R2 ESJpathK(u,W)=S
ΓSJpathK(u)|Γ.T 6=Γ, EAJpathK(u))
R3 VJpK(u)=S
(u,p,v)∈D(u)({u,v},{(u,p,v)},u,{v})
R4 VJpˆK(u)=S
(v,p,u)∈D(u)({u,v},{(v,p,u)},u,{v})
R5 VJ< >K(u)=S
p∈ U S
(u,p,v)∈D(u)({u,v},{(u,p,v)},u,{v})
R6 VJactK(u)=({u},,u,{u})
R7 VJpath1/path2K(u)=VJpath1K(u)L
vΓ.T |ΓVJpath1K(u)(VJpath2K(v)({v},,v,))
R8 VJ(path)K(u)=({u},,u,{u})S
i=1VJ(pathi)K(u)|path1=path pathi=pathi1/path
R9 VJpath1|path2K(u)=VJpath1K(u)VJpath2K(u)
R10 VJpath[test]K(u)=L
ΓVJpathK(u)(Γ.V, Γ.E , Γ.s, {vΓ.T |Evaluate(test,v) = true})
R11 SJpK(u)=L
(u,p,v)∈D(u)({u,v},{(u,p,v)},u,{v})
R12 SJpˆK(u)=L
(v,p,u)∈D(u)({u,v},{(v,p,u)},u,{v})
R13 SJ< >K(u)=L
p∈ U L
(u,p,v)∈D(u)({u,v},{(u,p,v)},u,{v})
R14 SJactK(u)=({u},,u,{u})
R15 SJpath1/path2K(u)=SJpath1K(u)L
vΓ.T |ΓSJpath1K(u)SJpath2K(v)
R16 SJ(path)K(u)=({u},,u,{u})L
i=1SJ(pathi)K(u)|path1=path pathi=pathi1/path
R17 SJpath1|path2K(u)=SJpath1K(u)SJpath2K(u)
R18 SJpath[test]K(u)=L
ΓSJpathK(u)(Γ.V, Γ.E , Γ.s, {vΓ.T |Evaluate(test,v) = true})
R19 EAJpathK(u)={Exec(act,D(u)) : (u,act)AJpathK(u)}
R20 AJpK(u)=
R21 AJpˆK(u)=
R22 AJ< >K(u)=
R23 AJactK(u)={(u,act)}
R24 AJpath1/path2K(u)=AJpath1K(u)SvΓ.T |Γ(S|V)Jpath1K(u)AJpath2K(v)
R25 AJ(path)K(u)=S
1AJpathiK(u) : path1=path pathi=pathi1/path
R26 AJpath1|path2K(u)=AJpath1K(u)AJpath2K(u)
R27 AJpath[test]K(u)=AJpathK(u)
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:23
dbpedia:Eric_Clapton
dbpedia:Plastic_
Ono_Band
dbpedia:<http://dbpedia.org/resource/>
dbpo:<http://dbpedia.org/ontology/>
dbpedia:The_Beatles
dbpedia:The_Rolling_
Stones
dbpo:associated
Band
dbpo:associated
Band
dbpo:associated
Band dbpedia:Rock_musicdbpo:genre
dbpo:genre
dbpedia:Blues_rock
dbpo:genre
dbpo:genre
dbpo:associated
Band
dbpedia:The_Who
dbpo:associated
Band
dbpedia:Dire_Straits
dbpedia:Mark_
Knopfler
dbpo:associated
Act
dbpedia:Celtic_rock
dbpo:genre
dbpo:associated
Band
dbpo:music
FusionGenre
dbpo:associated
Act
e = dbpo:associatedBand/dbpo:genre
dbpedia:Eric_Clapton EC
Abbreviations
dbpedia:Plastic_Ono_Band POB
dbpedia:The_Beatles TB
dbpedia:The_Rolling_Stones TRS
dbpedia:Dire_Straits DS
dbpedia:Rock_music RM
dbpedia:The_Who TW
dbpedia:Blues_rock BR
dbpedia:Celtic_rock CR
dbpo:associatedBand AB
dbpo:associatedAct AA
dbpo:genre G
dbpo:musicFusionGenre MFG
Fig. 7. An excerpt of the WLOD taken from DBpedia.
VISITED semantics. The starting point for the evaluation of eis Rule R1, which will
then trigger Rules R7, R19 and R24. Here, it is necessary to evaluate the two parts
of the concatenation both for building the MPG (Rule R7) and collecting actions (Rule
R24). To build the MPG, the subexpression Associated Band (AB) is evaluated by using
Rule R3 starting from Eric Clapton (EC):
VJABK(EC)={({EC,POB, TB, TRS, DS},{(EC,AB,POB),(EC,AB,TB),(EC,AB,TRS),
(EC,AB,DS)},EC,{POB, TB, TRS, DS})}.
Then, from each ending node of each MPG in VJABK(EC)(i.e., POB, TB, TRS, DS) the
subexpression dbpo:genre (G) is evaluated by applying again rule R3, thus obtaining:
VJGK(POB) = Γ
VJGK(TB) = {({TB,RM},{(TB, G, RM)},TB,{RM})}
VJGK(TRS) = {({TRS,RM, BR},{(TB, G, RM),(TB, G, BR)},TRS,{RM, BR})}
VJGK(DS)=Γ.
Then, according to rule R7, the results of VJGK(POB),VJGK(TB),VJGK(TRS)and
VJGK(DS)are combined through the disjoint union (i.e., operator) with the MPGs
({POB},,POB,),({TB},,TB,),({TRS},,TRS,),({DS},,DS,); this gives the set:
VΓ={({TB,RM},{(TB, G, RM)},TB,{RM}),
(TRS,RM, BR},{(TB, G, RM),(TB, G, BR)},TRS,{RM, BR}),
({POB},,POB,),({TB},,TB,),({TRS},,TRS,),({DS},,DS,)}.
Then, the set VJABK(EC)is composed (i.e., operator) with VΓthus giving:
VJAB/GK(EC)={({EC,POB,TB,TRS,DS,RM},{(EC,AB,POB),(EC,AB,TB),
(EC,AB,TRS),(EC,AB,DS),(TB, G, RM)},EC,{RM}),
({EC,POB,TB,TRS,DS,RM,BR},{(EC,AB,POB),(EC,AB,TB),
(EC,AB,TRS),(EC,AB,DS),(TRS,G,RM),(TRS,G,BR)},EC,{RM,BR}),
({EC,POB,TB,TRS,DS},{(EC,AB,POB),(EC,AB,TB),(EC,AB,TRS),
(EC,AB,DS)},EC,)}.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:24 V. Fionda et al.
Finally, according to rule R1, the union (operator) of all MPGs in VJAB/GK(EC)is
computed, thus returning the following MPG (reported in Fig. 8):
EVJeK(EC,W)={({EC,POB,TB,TRS,DS,RM,BR},{(EC,<AB>,POB),(EC,AB,TB),
(EC,AB,TRS),(EC,AB,DS),(TB,G,RM),(TRS,G,RM),
(TRS,G,BR)},EC,{RM,BR})}
A similar process is performed to collect and execute actions. In this case, since the
expression does not contain actions the final set of actions is the empty set.
dbpedia:<http://dbpedia.org/resource/>
dbpo:<http://dbpedia.org/ontology/>
dbpedia:Eric_
Clapton
dbpedia:Plastic_
Ono_Band
dbpedia:The_
Beatles
dbpedia:The_Rolling_
Stones
dbpo:associated
Band
dbpo:associated
Band
dbpo:associated
Band
dbpedia:Rock_
music
dbpo:genre
dbpedia:Blues_
rock
dbpo:genre
dbpo:genre
dbpedia:Dire_
Straits
dbpo:associated
Band
starting
node
dbpedia:Eric_Clapton
ending
nodes
dbpedia:Rock_music
dbpedia:Blues_rock
Fig. 8. The MPG as for the VISITED semantics.
dbpedia:Eric_
Clapton
dbpedia:The_
Beatles
dbpedia:The_Rolling_
Stones
dbpo:associated
Band
dbpedia:Rock_
music
dbpo:genre
dbpedia:Blues_
rock
dbpo:genre
dbpo:genre
dbpo:associated
Band
dbpedia:<http://dbpedia.org/resource/>
dbpo:<http://dbpedia.org/ontology/> starting
node
dbpedia:Eric_Clapton
ending
nodes
dbpedia:Rock_music
dbpedia:Blues_rock
Fig. 9. The MPG as for the SUCCESSFUL semantics.
SUCCESSFUL semantics. The evaluation is similar to the previous case, but the start-
ing point for the evaluation of eis Rule R2. According to the SUCCESSFUL semantics
the resulting MPG (reported in Fig. 9) is:
ESJeK(EC,W)={({EC,TB,TRS,RM,BR},{(EC,AB,TB),(EC,AB,TRS),(TB,G,RM),
(TRS,G,RM),(TRS,G,BR)},EC,{RM,BR})}
4.4. Capturing Web fragments: algorithms and complexity
We are now ready to present algorithms to capture Web fragments according to the
VISITED and SUCCES SFUL semantics. We recall the following notation; Wdenotes a
WLOD instance; Wedenotes the fragment of Wnavigated when evaluating an expres-
sion ewhile uris(We)and triples(We)are its set of URIs and triples, respectively. Σe
is the set of RDF predicates that appear in eand |e|the size of the expression. eTis the
core expression associated to eand Λand the set of symbols used in lieu of tests and
actions, respectively. Moreover, Ae= (Q, ΣeΛ, δ, q0, F )is the NFA that accepts the
language generated by eT. Now, we introduce some definitions.
DEFINITION 4.5. (MPG of a Web fragment). Given a WLOD instance W, a
NAUT ILOD expression eand a seed node, WVis the MPG obtained by evaluating eover
Wstarting from seed according to the VISITED semantics. WSis the MPG obtained by
evaluating eover Wstarting from seed according to the SUCCESSFUL semantics.
THEOREM 4.6. The MPG WVof a NAUTILOD expression ecan be computed in time
O(|We|×|e|) + CT, where CTis the cost of evaluating tests.
PROOF. The MPG corresponding to the VISITED semantics is obtained by execut-
ing Algorithm 3. WVcan be computed by building the product automaton W × Ae=
(Q0,ΣeΛ, δ0, q0
0, F 0)with cost O(|We|×|Ae|) + CT(see Theorem 3.2). Starting with
an empty MPG, WVcan be built by visiting W ×Ae(lines 4-12). In particular, for each
transition ((u,), x, (u0,)) in δ0s.t. xΣe, the edge (u, x, u0)and the nodes uand u0are
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:25
added to WV. It is also necessary to specify the starting node and the set of ending
nodes. In particular, the starting node of WVis the seed URI (line 3) and the set of
ending nodes is T={u|(u, qf)Q0qfF}(lines 6-7). Given the product automaton
W × Ae,WVcan be computed by visiting each transition and each node exactly once
with a cost O(|Q0|+|δ0|) = O(|uris(We)|×|Q|+|We| × |Ae|) = O(|We|×|e|). The total
cost of the algorithm (also considering the cost of building the product automaton) is
O(|We| × |e|) + CT. Note that WVcould also be built without any overhead w.r.t the
construction of W ×Ae; this could be done by adding on-the-fly to WVnodes and edges
encountered during the navigation by slightly modifying Algorithm 1.
Algorithm 3: visited(NAUTILOD expression e, URI seed)
Input: NAUTILOD expression e, seed URI s
Build:WV= MPG visited during the evaluation of e
1build the NFA Ae= (Q, ΣeΛ, δ, q0, F )
2W × Ae= construct (e,seed)
3set seed as the starting node of WV
4for each state (u, q) W × Aedo
5add uto the set of nodes of WV
6if qFthen
7add uto the ending nodes of WV
8for each transition δ0((u, q), x) W × Aedo
9for each (u0, q0)δ0((u, q ), x)do
10 if xΣethen
11 add u0to the set of nodes of WV
12 add (u, x, u0)to WV
13 return WV
THEOREM 4.7. WScan be computed in time O(|W| × |e|) + CTby navigating the
product automaton backward (from the final states to the initial state).
PROOF.WScan be built by using W × Aeaccording to Algorithm 4. In particular,
again, the idea is to start with an empty MPG and navigate the product automaton
backward (from the final states to the initial) by adding nodes and edges to WS(lines
8, 11 and 13). Each state and each transition (in the opposite direction) is visited at
most once with cost O(|Q0|+|δ0|) = O(|uris(We)| × |Q|+|We| × |Ae|) = O(|We|×|e|).
Thus, the total cost, when also considering the cost of building the product automaton,
is O(|We|×|e|) + CT.
5. AN IMPLEMENTATION OF NAUTILOD
This section presents an implementation of NAUTI LOD in the swget tool. swget has
been implemented in Java in different versions: (i) a developer release, which includes
a command-line tool and an API; (ii) an end user release, which features a GUI; (iii)
a Web portal8[Fionda et al. 2014b]. swget uses Jena9to deal with RDF data. Besides
the features presented in Section 3 and Section 4, swget adds a set of ad-hoc options to
further control the navigation from a network-oriented perspective.
8http://swget.inf.unibz.it
9http://jena.apache.org
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:26 V. Fionda et al.
Algorithm 4: successful(NAUTILOD expression e, URI seed)
Input: NAUTILOD expression e, seed URI seed
Build:WS= MPG of all the paths in WVthat spell e
1build the NFA Ae; assume q0as its initial state and Fas the set of final states
2W × Ae= construct (e,seed)
3set seed as the starting node of WS
4initialize toVisit with all state (u, qf)∈ W × Ae, such that qfF
5put each us.t. (u, qf)∈ W × Ae, with qfFin the set of ending nodes of WS
6while toVisit is not empty do
7remove from toVisit a pair (u, q)
8add uto the set of nodes of WS
9for each transition δ((u, q), x) W × Aedo
10 for each (u0, q0)δ((u, q ), x)s.t.(u0, q0)has not been already checked do
11 add u0to the set of nodes of WS
12 if xΣethen
13 add (u, x, u0)to WS
14 add (u0, q0)to toVisit
15 return WS
Such options include, among the others, the possibility to limit the size of data trans-
ferred and enable/disable caching. In particular, swget can be configured to use a cache
that keeps Jena Model objects associated to dereferenced URIs. The cache has the life-
time of the evaluation of an expression. However, it could also be made persistent, for
instance, via a hashtable-like structure with keys being URIs and values Jena Model
objects built from data obtained by dereferencing such URIs. This will improve the run-
time performance of the system by avoiding to dereference multiple times the same
URI. However, the cache introduces problems of data freshness (e.g., how often the
cache has to be updated). This issue can be addressed either by setting a cache life-
time or by looking at the header of HTTP connections when dereferencing a URI and
getting data only if changes are detected with respect to data in the cache. Further
information and examples are available at the swget website10.
5.1. The swget tool
We now provide an overview of the implementation of NAUTILOD in swget. The con-
ceptual architecture of swget is reported in Fig. 10, which also shows the flow of infor-
mation. The user submits a script that contains a NAUTI LOD expression plus other
metadata (e.g., parameters, comments in natural language). The script is received by
the Interpreter module, which checks the syntax and initializes both the Execution
Manager (with the seed URI) and the Automaton Builder.
The Automaton builder, via a JavaCC-based parser11, generates the automaton as-
sociated to the expression; it will be used to drive the evaluation of such expression
on the WLOD. The Execution Manager controls the execution and passes the URIs
to be dereferenced (i.e., ids of nodes encountered when navigating the WLOD) to the
Network Manager that performs the dereferencing of URIs via HTTP GET calls. This
set of links is given to the Execution Manager, which starts over the cycle. The execu-
tion ends either when some navigational parameter imposes it (e.g., a threshold on the
network traffic has been reached) or when there are no more URIs to be dereferenced.
10http://swget.wordpress.com
11http://javacc.java.net
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:27
Network
Manager
RDF
Interpreter
Automaton
Automaton builder
Jena Model
Execution Manager
List of URIs
URI
Link Extractor RDF Manager
URI
URI
URI
URI
URI
URI
URI
URI
URI
URI URI
URI
RDF
HTTP GET URI
The
Web of Data
RDF data
swget
architecture
Actions
Send email
Retrieve data
….
Results
JAVACC
Visualization
Standalone
GUI
Web Portal
script
RDF
swget params
users
Fig. 10. The swget high-level architecture.
At the end of the execution, the results include: (i) the Web fragment (represented
in RDF) obtained according to the Visited/Successful semantics (see Section 4); (ii) the
execution of the actions fired during the navigation. The Web fragment can be locally
stored and manipulated via third-party tools or can be accessed via the swget GUI
or the Web portal. The GUI makes usage of the Prefuse API12 for the visualization
and manipulation of graphs. The Web portal has been developed by using the Flash
technology (i.e., the Flex application framework) and the Flare visualization library13.
The command line implementation of swget has been thought for developers that
need low-level access to the swget functionality. This implementation is in the same
spirit of tools like wget14 and curl15. Our implementation features a Java API that can
be used to embed swget’s features into third-party applications. As an example, a user
could leverage this API to embed in a standard HTML page structured information
taken from the WLOD about his/her favorite movies, books or scientific papers. In this
respect, the swget expression used to achieve this goal can be thought of as a view over
the WLOD that can be run on a regular basis to keep the page up-to-date.
The swget GUI facilitates the writing of swget scripts and provides a visual overview
of the results of the execution of such scripts. Fig. 11 shows the interface for the cre-
ation of scripts. As it can be observed, it is possible to specify the seed URI plus the
other parameters (e.g., the NAU TILOD expression). Fig. 12 shows the fragment visual-
ization tab of the swget GUI. In this particular case, the co-author network of Alberto
O. Mendelzon (the node in red) when considering papers published between 1980 and
1990 is shown. Besides the ending nodes (nodes in violet), the fragment also contains
other nodes (nodes in blue) that belong to paths from Mendelzon to his co-authors
satisfying the expression. Note that the GUI enables to search for nodes/edges, zoom
in/out and rearrange the visualization of the Web fragment.
12http://prefuse.org/
13http://flare.prefuse.org/
14http://www.gnu.org/software/wget/
15http://curl.haxx.se/
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:28 V. Fionda et al.
log area
swget options
NautiLOD expression
seed URI
Fig. 11. The swget GUI. On the left hand side, the log area prints messages about the current evaluation.
On the right hand side, it is possible to input the NAU TILOD expression and set network parameters.
URI selection
visualization control
A.O.Mendelzon
M.Yannakakis
D.Maier
P.Atzeni
J.Ullman
Fig. 12. Exploring the results of the evaluation of a NA UTIL OD expression.
6. EXPERIMENTAL EVALUATION
We have introduced the NAUTILOD language and described three formal semantics:
one that captures sets of nodes (Section 3.2) and two that capture Web fragments (Sec-
tion 4). We have also described automaton-based algorithms to evaluate NAUTILOD
expressions according to the different semantics and discussed their complexity (Sec-
tions 3.4.1 and Section 4.4). Besides, in Section 5 we introduced an implementation
of NAU TILOD in the swget tool, which is available in three different releases. In this
section, we discuss an experimental evaluation aimed at assessing the cost of evaluat-
ing NAUTILOD expressions (via swget) on the WLOD. All the experiments have been
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:29
performed on a PC with an Intel 2.5GHz Core i5 processor and 8GBs of RAM memory.
Running times are the average of 5 runs; moreover, number of results and number of
dereferenced URIs are rounded to the next integer value. The objective of the evalua-
tion was to measure: (i) execution time; (ii) number of URIs dereferenced; (iii) number
of results retrieved; (iv) performance of swget as compared to related systems.
6.1. Experiment 1: semantics returning nodes
The first experiment focuses on the semantics of NAUTI LOD returning nodes intro-
duced in Section 3.2 and the associated algorithms (i.e., Algorithm 1 and Algorithm 2).
6.1.1. The cost of evaluating NAUT ILOD expressions. In this section we discuss the various
components that affect the cost of the evaluation of a NAUTI LODexpression. We first
introduce some notation. We denote by ea NAUT ILOD expression, Athe set of actions
and Tthe set of tests specified in ethat have to be performed when evaluating it, and
by ethe action-and-test-free expression obtained by removing from eall tests in Tand
actions in A. Abstractly, we can separate the cost of evaluating eover a WLOD instance
Win three parts:
cost(e, W) = cost(e, W) + cost(A) + cost(T).(1)
The first component considers the cost of evaluating the expression by taking out ac-
tions and tests. Actions do not affect the navigation process and then we can treat
their cost separately. Moreover, tests are ASK-SPARQL queries having a different
structure from the pure navigational path expressions of the language; even in the
case of tests we can treat their cost independently. The cost of actions has essentially
two components: execution and transmission. The execution cost boils down to the cost
of evaluating the SELECT SPARQL query that gives the action’s parameters. As for
transmission costs, a typical example is the sending of an email message including
some data. In this case the cost is that of sending such email. Note that what really
matters for our discussion is not the whole W, but only the set of nodes visited when
evaluating the expression; we will refer to such set as We. Thus, we have:
cost(e, W) = cost(e, We) + cost(A) + cost(T).(2)
Note that the usage of tests possibly reduces the size of the set of nodes visited
during the evaluation. Thus, the cost(e, We)has to be reduced to take into account the
effective subset of nodes reachable after the filtering performed via tests. Let WTbe
the portion of Wewhen taking into account this filtering. We have:
cost(e, W) = cost(e, WT) + cost(A) + cost(T).(3)
Finally, we want to point out that there is a relation between the diffusion (portion of
linked datasources using a given predicate) and selectivity (the number of RDF links
with a given predicate for a given URI) of the predicates used in the NAUTI LOD ex-
pression and its evaluation cost. In particular, the usage of predicates having lower dif-
fusion and higher selectivity allow to reduce the size of the portion of the WLOD visited
during the evaluation and the generated network traffic. For example, we noticed that
some predicates (e.g., dbpedia:influenced) have a high diffusion and a low selectivity
thus allowing to span a larger portion of the WLOD and causing the dereferencing of
several URIs. Some other predicates (e.g., dbpedia:director) have a lower diffusion
allowing to reduce the portion of the WLOD that is reached during the evaluation of
an expression. Finally, other predicates such as dbpedia:birthDate have a high diffu-
sion but very high selectivity (having at most one occurrence for each URI visited). An
interesting study about these aspects has been recently performed [Schmachtenberg
et al. 2014].
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:30 V. Fionda et al.
6.1.2. Experiments. We evaluate the impact of the different components of a NAU -
TILOD expression according to the equations introduced above. We chose two com-
plex expressions (shown in Fig. 13) and measured execution time (t), number of URIs
dereferenced (d) and number of triples retrieved (n). To investigate how the various
components affect the parameters t,dand n, each NAUTI LOD expression has been
divided in 5 parts (i.e., σi, i ∈ {1..5}). Such parts have been executed as a whole (i.e.,
σ[1-5]) and as action-and-test-free expressions (i.e., σ[noAT ]), for which we use the nota-
tion e1and e2, respectively. Moreover, the various sub-expressions (i.e., σ[1-i], i ∈ {1..4})
have also been executed. This leads to a total of 12 expressions.
e
1
e
2
seed:=dbpedia:Stanley_Kubrick
seed:=dbpedia:Italy
1:
dbpo:influenced <1-3>
2:
[ASK {?ctx dbpo:birthDate ?y. FILTER (?y>1961-01-01)}]
3:
ACT[sdEmail("x@y.z","SELECT ?p WHERE{?ctx foaf:name ?p}")]
4:
dbpo:director
5:
owl:sameAs?
1:
dbpo:homeTown
2:
[ASK {?ctx rdf:type dbpo:Person.
?ctx rdf:type dbpo:MusicalArtist.}]
3:
dbpo:birthPlace
4:
[ASK {?ctx dbpo:populationTotal ?p. FILTER (?p<15000)}]
5:
owl:sameAs*
Fig. 13. Expressions used in the evaluation. Expressions have been executed incrementally and as a whole.
The results, in logarithmic scale, are reported in Fig. 14 and Fig. 15. In particular, in
the x-axis are reported from left to right: the 4 sub-expressions, the full expression (i.e.,
σ[1-5]) and the action-and-test-free expression (i.e., σ[noAT ]). The first expression (e1)
starts by looking for people influenced by Stanley Kubrick up to distance 3 (subexpr.
σ[1-1]). This operation requires 61 secs., for a total of 221 URIs dereferenced. On the
description of each of these 221 URIs, an ASK query is performed to select only those
people that were born after 1961 (subexpr. σ[1-2]). The execution time of the queries
was of about 4 secs. (i.e., '0.02 secs., per query) with 31 entities selected. Then, an
action is performed on the descriptions of these 31 entities by selecting the foaf:name
to be sent via email (subexpr. σ[1-3]). The select operation, the rendering of the results
in HTML format and the transmission of the emails had a cost of about 25 secs. The
navigation continues from the 31 entities found before the triggering of the action to
retrieve movies via the RDF property dbpo:director (subexpr. σ[1-4]). The cost was of
about 34 secs., for a total of 136 movies. Finally, for each movie only one level of possible
additional descriptions is searched by the owl:sameAs property (the whole expr. σ[1-5])
whose cost is 1638 secs., for a total of 409 new URIs available from multiple servers
(e.g., linkedmdb.org,freebase.org) of which only 289 were dereferenceable.
Thus, we have that: cost(e1,W)=cost(e1,WT1)+cost(A1)+cost(T1)=1763; with
cost(A1)'25 secs., cost(T1)'4 secs. and cost(e1,WT1)'1738 secs. When considering
the test-and-action-free expression, cost(e1,W)'20018 secs. Note that the evaluation
of the ASK-SPARQL queries had a cost of about 4 secs., and allowed to reduce the
portion of the WLOD navigated by e1; this enabled to save about 20018-1738=18280
secs. Such a larger difference in the execution time is justified by the fact that the 221
initial URIs, selected by σ[1-1] were not filtered in the case of (e1,W)and then caused
a larger amount of paths to be followed at the second step of the evaluation. Indeed,
the total number of dereferenced URIs for (e1,W)was 6053, with about 660K triples
retrieved, while for (e1,WT1)it was 646, with 125K triples retrieved.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:31
61# 64# 91#
125#
1763#
20018#
84# 87# 101#
104#
161#
1600#
10#
100#
1000#
10000#
σ[1.1]#
σ[1.2]#
σ[1.3]#
σ[1.4]#
σ[1.5]#
σ[noAT]#
Time#(sec)#
e1#
e2#
(a) Execution time for the expressions in secs.
44261% 44261%
44261% 59013%
125521%
660515%
39477% 39477%
96798% 96798%
97649%
273955%
10000%
100000%
1000000%
σ[1.1]% σ[1.2]% σ[1.3]% σ[1.4]% σ[1.5]% σ[noAT]%
Number%of%triples%
e1%
e2%
(b) Number of triples retrieved.
Fig. 14. Evaluation of swget on e1and e2.
221# 221# 221#
357#
646#
6053#
400# 400# 442# 442#
458#
1277#
100#
1000#
σ[1-1]#
σ[1-2]#
σ[1-3]#
σ[1-4]#
σ[1-5]#
σ[noAT]#
Number#of#derefURI#
e1#
e2#
(a) Number of dereferenced URIs.
222"
31" 31"
136"
545"
6693"
399"
156"
42"
5" 29"
1053"
3"
30"
300"
3000"
σ[1,1]"
σ[1,2]"
σ[1,3]"
σ[1,4]"
σ[1,5]"
σ[noAT]"
Number"of"results"
e1"
e2"
(b) Number of results (expression endpoints).
Fig. 15. Evaluation of swget on e1and e2.
The second expression e2specifies the navigation of the property dbpo:homeTown to
find entities living in Italy (subexpression σ[1-1]); this had an execution time of about 84
secs. with a total of 400 dereferenced URIs (one seed plus 399 additional URIs). On the
description of each URI, an ASK query filters entities that are of type dbpo:Person and
dbpo:MusicalArtist (subexpression σ[1-2]). All the queries were performed in about 3.8
secs., with an average time per query of 0.01 secs.; this enabled to select 156 people.
At the second step, the navigation continues via the RDF property dbpo:birthPlace
to find the places where these people were born (subexpression σ[1-3]); this had a cost
of about 14 secs. In total, 42 new URIs were reached. The navigation continued with
a second ASK query to select only those places where less than 15000 habitants live
(subexpr. σ[1-4]). The cost of performing the 42 ASK queries was of about 3 secs and
5 places were selected. Finally, for each of the 5 places additional descriptions were
searched by navigating the owl:sameAs property (the whole expr. σ[1-5]). This allowed
to reach a total of 29 URIs, some of which were external to dbpedia.org. The cost for
this operation was of about 57 secs.
As for the total cost, we have cost(e2,W)=cost(e2,WT2)+cost(A2) + cost(T2)=161. The
factor cost(A2)=0since e2does not contain any action whereas cost(T2)'6secs., which
gives cost(e2,WT2)=155 secs. The cost of the action-and-test-free expression (i.e., e2)
over Wis cost(e2,W)'1600 secs., with 1277 URIs dereferenced. This is because the
expression is not selective since it performs a sort of “semantic” crawling only based
on RDF predicates. In fact, the number of triples retrieved (see Fig. 14 (b)) is almost
three times higher than in the case of the expression with tests. By including the tests,
the evaluation of e2is 1445 secs. faster.
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
0:32 V. Fionda et al.
6.2. Experiment 2: swget vs SWGETM
We now compare swget against a new implementation that leverages multi-threading
(referred to as SWGETM). For this comparison we used the following expression.
(foaf:knows)<1-X> (e3)
The notation <1-X> is a shorthand for the concatenation of up to X foaf:knows edges.
In the experiment, we considered values of X ranging from 1 to 5 and used A. Polleres’
FOAF profile16 as starting node. The results of the comparison are reported in Fig. 16.
In particular, Fig. 16 (a) compares swget with SWGETM when using 40 threads. As it
can be observed, there is a major improvement especially for long running queries.
!"#!$ %%"&!$
%'&"(#$
&&("&!$
&')"))$
!")($ &"&%$ ("*+$ #")#$ %+"#'$
!$
'!$
%!!$
%'!$
&!!$
&'!$
,!!$
-./0%$
-./0&$
-./0,$
-./0+$
-./0'$
1234/356$
/7830$
/78309:+!$
(a) Execution time(s) - e3
!"
#!"
$!"
%!"
&!"
'!"
()*+#"
()*+$"
()*+%"
()*+&"
()*+'"
,-./*.01"
*23.+45'"
*23.+45#!"
*23.+45$!"
*23.+45%!"
*23.+45&!"
(b) Execution time(s) - e3
Fig. 16. Comparison between swget and SWGETM on e3.
As an example, at dist5 in Fig. 16 (a), SWG ETM is 17 times faster than swget.
In Fig. 16 (b) it can be observed that for SWGET M, a sensible speedup is obtained by
increasing the number of threads up to 30. To test the benefit of SWG ETM, we have
also executed expressions e1and e2(see Section 6.1.2) with a pool of 30 threads; even
in this case we obtained a significant speedup (5x).
6.3. Experiment 3: comparison with related work
We compared SWG ETM against SQUIN17 and SPARQL 1.1’s property paths (for
which we considered the DBpedia SPARQL endpoint18). Since SQUIN is based on
a multi-thread implementation, we compared it with SWGETM only. Further de-
tails about the experimental setting and the complete query-set are available at
http://swget.wordpress.com/evaluation. We considered three different sets of ex-
pressions. The first set (e4) includes the influence network of Tim Berners-Lee (TBL)
in DBpedia restricted to scientists only. The seed node (i.e., TBL’s URI in DBpedia) is
dbpedia:Tim Berners-Lee while the NAUTI LOD expression is:
(dbp:influenced)<1-X>[Test](e4)
where the test is defined as follows:
Test=ASK {?ctx rdf:type dbpo:Scientist.}
16http://www.polleres.net/foaf.rdf#me
17http://www.squin.org
18http://dbpedia.org/snorql/
ACM Transactions on the Web, Vol. 0, No. 0, Article 0, Publication date: 0.
NAUT ILOD: A Formal Language for the Web of Data Graph 0:33
In the expression, the notation <1-X> is a shorthand for the concatenation of up to X
dbp:influenced while the test is used to filter ending nodes identifying scientists. In
the experiment we considered values of X ranging from 1 to 6.
The execution times for e4are shown in Fig. 17 (a). As for SWGET M, we used a thread
pool of size 5 to avoid many simultaneous connections requests that may overload
servers and generate errors that would result in the lost of results. As compared to
SQUIN, SWGETM reported a higher running time at dist6. However, by looking at Fig.
17 (b) it can be noted that SWGETM performs a much larger number of dereferencing
operations (433 vs. 117).
0"
5"
10"
15"
20"
dist1"dist2" dist3"dist4" dist5"dist6"
swgetM' SQUIN'
(a) Execution time(s) - e4
0"
100"
200"
300"
400"
dist1"dist2"dist3"dist4"dist5"dist6"