Conference PaperPDF Available

Scalable integration and processing of linked data

Authors:

Abstract

The goal of this tutorial is to introduce, motivate and detail techniques for integrating heterogeneous structured data from across the Web. Inspired by the growth in Linked Data publishing, our tutorial aims at educating Web researchers and practitioners about this new publishing paradigm. The tutorial will show how Linked Data enables uniform access, parsing and interpretation of data, and how this novel wealth of structured data can potentially be exploited for creating new applications or enhancing existing ones. As such, the tutorial will focus on Linked Data publishing and related Semantic Web technologies, introducing scalable techniques for crawling, indexing and automatically integrating structured heterogeneous Web data through reasoning.
Scalable Integration and Processing of Linked Data
Andreas Harth
Institute AIFB, Karlsruhe
Institute of Technology
76128 Karlsruhe, Germany
harth@kit.edu
Aidan Hogan
Digital Enterprise Research
Institute (DERI), National
University of Ireland, Galway
Galway, Ireland
aidan.hogan@deri.org
Spyros Kotoulas
Dept. of Computer Science,
Vrije Universiteit Amsterdam
1081 HV, Amsterdam, The
Netherlands
kot@few.vu.nl
Jacopo Urbani
Dept. of Computer Science,
Vrije Universiteit Amsterdam
1081 HV, Amsterdam, The
Netherlands
jacopo@cs.vu.nl
ABSTRACT
The goal of this tutorial is to introduce, motivate and de-
tail techniques for integrating heterogeneous structured data
from across the Web. Inspired by the growth in Linked Data
publishing, our tutorial aims at educating Web researchers
and practitioners about this new publishing paradigm. The
tutorial will show how Linked Data enables uniform access,
parsing and interpretation of data, and how this novel wealth
of structured data can potentially be exploited for creating
new applications or enhancing existing ones.
As such, the tutorial will focus on Linked Data publishing
and related Semantic Web technologies, introducing scal-
able techniques for crawling, indexing and automatically in-
tegrating structured heterogeneous Web data through rea-
soning.
Categories and Subject Descriptors
H.3.5 [Information Storage and Retrieval]: [On-line In-
formation Services]
General Terms
Management
1. OVERVIEW
Linked Data is now a central topic in the Semantic Web;
with recent adoption by governmental agencies (e.g., data.gov,
data.gov.uk) and corporate entities (e.g., BestBuy, BBC,
New York Times, Facebook, Metaweb) complimenting rich
community-driven exports (e.g., DBpedia, Geonames, FOAF
and SIOC), its importance is steadily growing. Linked Data
provides a rich source of openly-available structured data
which current Web applications and research can exploit.
As more and more Linked Data is published, such growth
challenges the community to develop technologies which can
Supported by the European Commission through projects
PlanetData (FP7-257641) and LarKC (FP7-215535)
Copyright is held by the author/owner(s).
WWW 2011, March 28–April 1, 2011, Hyderabad, India.
ACM 978-1-4503-0637-9/11/03.
handle data at Web scale. Taking these cues, our tutorial
will focus on (i) pragmatic and (ii) scalable techniques for
consuming Linked Data which are (iii) applicable to hetero-
geneous datasets as now available on the Web. These tech-
niques include crawling, querying, reasoning and composing
pipelined workflows for processing Linked Data. Semantic
Web technologies are often presented in a more theoretical
or research-focused scope, whereas we will provide a more
tangible introduction and hands-on approach to the topic,
thus targeting the wider Web community—as such, our tu-
torial is tailored to be of interest to a wide catchment of
WWW 2011 attendees.
2. CONTENT
2.1 Introduction to RDF and Linked Data
The first session gives an overview of RDF and Linked
Data publishing. We will discuss the RDF data model and
Linked Data principles for publishing RDF data on the Web.
In particular, this session will cover:
RDF rationale and basics
Linked Data principles and introduction
Current adoption and trends in Linked Data
Example materials include slides discussing end-user ben-
efits of Governmental Linked Data, presented at the ICT
2010 event organised by the European Commission1and a
paper discussing Linked Data publishing (see [2]).
2.2 Scalable Linked Data Crawling
This session gives an overview of the state of the art in
efficient data-retrieval techniques, including novel challenges
and techniques for crawling Linked Data from the Web.
We will present the architecture of a crawler for small to
medium-sized datasets in the range to several hundred mil-
lion triples. In particular, this session will cover:
Linked Data location, access and crawling
1http://www.w3.org/2010/09/egov-session-ict2010/
open-data-applications.pdf
WWW 2011 – Tutorial
March 28–April 1, 2011, Hyderabad, India
295
Scalable crawling techniques and algorithms
Description of the open-source LDSpider Linked Data
crawler
Example materials include open-source code for crawling
Linked Data2, papers discussing scalable RDF crawling (see
[3]); and presentation on the MultiCrawler architecture for
distributed crawling of structured data3.
2.3 Scalable RDF Indexing Techniques
This session presents scalable techniques for indexing and
querying local repositories of Linked Data. We will discuss
the standardised SPARQL query-language and thereafter
discuss the state-of-the-art in RDF storage with respect to
research, directions and applications. In particular, this ses-
sion will cover:
Overview and challenges
Introduction to the SPARQL standard
Scalable RDF indexing systems
Example materials include the paper [1] and presentation4
on the YARS2 distributed RDF storage system.
2.4 Reasoning: Motivation and Overview
This session gives an introduction to the RDFS and OWL
standards and to rule-based reasoning, with heavy empha-
sis on motivating reasoning for the Linked Data use-case
and for integrating heterogeneous data from a large num-
ber of diverse sources. We also introduce algorithms which
incorporate information about the provenance of data dur-
ing reasoning to ensure robustness in the face of noisy or
impudent remote data. In particular, this session will cover:
Introduction to RDFS/OWL
Introduction to rule-based reasoning
Motivating reasoning with respect to Linked Data
Web-reasoning approaches (web-scale/web-tolerant)
Example materials include papers relating to Linked Data
reasoning (see [6, 4]); a tutorial introduction to Linked Data
and OWL5; Talis Research Seminar on reasoning over Web
Data6.
2.5 Scalable Distributed Reasoning over Map-
Reduce
This session presents scalable distributed reasoning using
the MapReduce distribution framework, enabling high per-
formance over a cluster of commodity hardware. This ses-
sion details the MapReduce framework (employed by Google
and Yahoo, among others) and the award-winning WebPIE
system which integrates optimised execution strategies for
rules supporting a (pragmatic) fragment of OWL semantics.
MapReduce architecture
2http://code.google.com/p/ldspider/
3http://videolectures.net/iswc06_harth_ciswd/
4http://videolectures.net/iswc07_harth_frs/
5http://www.abdn.ac.uk/~csc280/tutorial/eswc2010/
6http://vimeo.com/15255566
Core optimisations and approach for distributed rea-
soning
Implementing RDFS
Extension to pD* (OWL-Horst)
Hands-on: how to launch the reasoner on a cluster and
on the Amazon cloud
Example materials include papers on reasoning over the
MapReduce framework (see [6, 5]); Talis Research Seminar
on scalable reasoning7; the official WebPIE page8.
2.6 Hands-on: Implementing a LarKC Work-
flow
This session allows attendees to get hands-on with build-
ing scalable linked data applications. Some of the technolo-
gies presented in the previous sessions will be put together
using a scalable workflow engine tailored for Linked Data:
the Large Knowledge Collider (LarKC).
Workflow overview and rationale
LarKC platform overview
Hands-on: Building a LarKC Workflow for crawling
and reasoning over Linked Data
Example materials include a tutorial on the LarKC archi-
tecture9.
3. REFERENCES
[1] A. Harth, J. Umbrich, A. Hogan, and S. Decker.
YARS2: A federated repository for querying graph
structured data from the web. In 6th International
Semantic Web Conference, 2nd Asian Semantic Web
Conference, pages 211–224, 2007.
[2] A. Hogan, A. Harth, A. Passant, S. Decker, and
A. Polleres. Weaving the Pedantic Web. In 3rd
International Workshop on Linked Data on the Web
(LDOW2010), Apr. 2010.
[3] A. Hogan, A. Harth, J. Umbrich, S. Kinsella,
A. Polleres, and S. Decker. Searching and Browsing
Linked Data with SWSE: the Semantic Web Search
Engine. Technical report, Digital Enterprise Research
Institute, Galway, 2010. http://www.deri.ie/
fileadmin/documents/DERI-TR-2010-07- 23.pdf.
[4] A. Hogan, J. Z. Pan, A. Polleres, and S. Decker. SAOR:
Template Rule Optimisations for Distributed
Reasoning over 1 Billion Linked Data Triples. In
International Semantic Web Conference, 2010.
[5] J. Urbani, S. Kotoulas, J. Maassen, F. van Harmelen,
and H. E. Bal. OWL Reasoning with WebPIE:
Calculating the Closure of 100 Billion Triples. In
ESWC (1), pages 213–227, 2010.
[6] J. Urbani, S. Kotoulas, E. Oren, and F. van Harmelen.
Scalable Distributed Reasoning Using MapReduce. In
International Semantic Web Conference, pages
634–649, 2009.
7http://vimeo.com/15256587
8http://cs.vu.nl/webpie
9http://videolectures.net/eswc2010_kotoulas_iotl/
WWW 2011 – Tutorial
March 28–April 1, 2011, Hyderabad, India
296
... In this sense, some Linked Data design considerations can be found to cover from the design or URIs [47,65], design patterns [66,67], publication of RDF datasets and vocabularies [68], etc. to the establishment of Linked Data profiles [69, p. 0]. On the other hand, consumption of Linked Data is being addressed to provide new ways of data visualization [70], faceted browsing [71,72] and searching [73], processing [74] and exploitation of data applying different approaches such as sensors [75] and new techniques such as federated [76] and distributed queries [77] scalable reasoning process [78,79], annotation of web pages [80] or information retrieval [81] to name a few. Finally, in the industry context [82], the semantic web has also attracted significant attention in last years to deliver added-value services to their clients or to internally improve the management and interoperability of different industrial settings in areas such as enterprise application integration, knowledge management or e-Commerce. ...
Article
The present paper introduces and reviews existing technology and research works in the field of e-Procurement. More specifically this survey aims to collect those relevant approaches that have tackled the challenge of delivering more advanced and intelligent e-Procurement management systems due to its relevance in the industry to afford more timely, adaptable and flexible decisions in purchasing processes. Although existing tools and techniques have demonstrated their ability to manage e-Procurement processes as a part of a supply management system there is a lack of interoperability among tools, tangled dependencies between processes or difficulties to exploit existing data and information to name a few that are preventing a proper use of the new dynamic and data-based environment. On the other hand semantic-based technologies emerge to provide the adequate building blocks to represent domain-knowledge and elevate the meaning of information resources through a common and shared data model (RDF) with a formal query language (SPARQL) and accessible via the Internet Protocols. In this sense the Linked Data effort has gained momentum to apply the principles of the aforementioned initiative to boost the re-use of information and data across different tools and processes. That is why authors review both existing open issues in the context e-Procurement with special focus on public procurement and semantic-based approaches to address them. To do so a preliminary research study is conducted to assess the state of the art in the context of e-Procurement and semantic-based systems. Afterwards main drawbacks of existing e-Procurement systems are presented to narrow down in semantic-based approaches applied to this field. Once the current status in both areas is reviewed, authors purpose the use and creation of an e-Procurement index to evaluate the quality of service of procurement systems. In this light the Analytical Hierarchy Process (AHP) method is used to set up an initial weight for each indicator in the index and to perform a first comparison between traditional and semantic-based approaches. Finally some discussion, conclusions and future challenges are also outlined.
... In the first case data quality (Bizer, 2007;Bizer & Cyganiak, 2009;Alvaro Graves Olye Rickson, 2011;Stadler, Guéret, Lehmann, Groth, & Jentzsch, 2011), conformance (Hogan et al., 2012), provenance (Moreau & Missier, 2011;Hartig & Zhao, 2010), trust (Carroll, Bizer, Hayes, & Stickler, 2005), description of datasets (Alexander, Cyganiak, Hausenblas, & Zhao, 2011;Cyganiak, Stenzhorn, Delbru, Decker, & Tummarello, 2008;TaskForce Linking Open Data Data-Sets, 2011) and entity reconciliation (Araujo, Hidders, Schwabe, & De Vries, 2011;Maali & Cyganiak, 2011) issues are becoming major objectives since a mass of amount data is already available through RDF repositories and SPARQL endpoints. On the other hand, consumption of Linked Data is being addressed to provide new ways of data visualization (Dadzie & Rowe, 2011;Hogan et al., 2011), faceted browsing (Pietriga, Bizer, Karger, & Lee, 2006;Tummarello et al., 2010), searching and data exploitation (Harth, Hogan, Kotoulas, & Urbani, 2011). Some approaches based on sensors (Jeung et al., 2010;Calbimonte, Jeung, Corcho, & Aberer, 2011), distributed queries (Hartig, Bizer, & Freytag, 2009;Ankolekar, Krötzsch, Tran, & Vrandecic, 2007;Buil-Aranda, Arenas, & Corcho, 2011), scalable reasoning processes (Urbani, Kotoulas, Maassen, van Harmelen, & Bal, 2012;Bonatti, Hogan, Polleres, & Sauro, 2011), annotation of web pages (Adida & Birbeck, 2008) or information retrieval (Pound, Mika, & Zaragoza, 2010) are key-enablers for easing the access to information and data. ...
Article
The present paper introduces a method to promote existing controlled vocabularies to the Linked Data initiative. A common data model and an enclosed conversion method for knowledge organization systems based on semantic web technologies and vocabularies such as SKOS are presented. This method is applied to well-known taxonomies and controlled vocabularies in the business sector, more specifically to Product Scheme Classifications created by governmental institutions such as the European Union or the United Nations. Since these product schemes are available in a common and shared data model, the needs of the European e-Procurement sector are outlined to finally demonstrate how Linked Data can address some of the challenges for publishing and retrieving information resources. As a consequence, two experiments are also provided in order to validate the gain, in terms of expressivity, and the exploitation of this emerging approach to help both expert and end-users to make decisions on the selection of descriptors for public procurement notices.
... In the semantic web area and due to the apparition of OWL 2-RL, SPARQL Rules! and RIF these systems are growing in their use to deal with the web of data but a clear approach to mix datasets and RBSs is still missing. [27]. ...
Article
Full-text available
This paper aims to describe a public procurement information platform which provides a unied pan-european system that exploits the aggregation of tender notices using linking open data and semantic web technologies. This platform requires a step-based method to deal with the requirements of the public procurement sector and the open government data initiative: 1) modeling the unstructured information included in public procurement notices (contracting authorities, organizations, contracts awarded, etc.), 2) enriching that information with the existing product classication systems and the linked data vocabularies; 3) publishing relevant information extracted out of the notices following the linking open data approach; 4) implementing enhanced services based on advanced algorithms and techniques like query expansion methods to exploit the information in a semantic way. Taking into account that public procurement notices contain information variables like type of contract, region, duration, total amount, target enterprise , etc. dierent methods can be applied to expand user queries easing the access to the information and providing a more accurate information retrieval system. Nevertheless expanded user queries can involve an extra-time in the process of retrieving notices. That is why a performance evaluation is outlined to tune up the semantic methods and the generated queries providing a scalable and time-ecient system. Moreover, this platform is supposed to be especially relevant for SMEs that want to tender in the European Union (EU), easing their access to the information of the notices and fostering their participation in cross-border public procurement processes across Europe. Finally an example of use is provided to evaluate and compare the goodness and the improvement of the proposed platform regarding to the existing ones.
Article
This paper aims to describe a public procurement information platform which provides a unified pan-European system that exploits the aggregation of tender notices using linking open data and semantic web technologies. This platform requires a step-based method to deal with the requirements of the public procurement sector and the open government data initiative: (1) modeling the unstructured information included in public procurement notices (contracting authorities, organizations, contracts awarded, etc.); (2) enriching that information with the existing product classification systems and the linked data vocabularies; (3) publishing relevant information extracted out of the notices following the linking open data approach; (4) implementing enhanced services based on advanced algorithms and techniques like query expansion methods to exploit the information in a semantic way. Taking into account that public procurement notices contain different kinds of data like types of contract, region, duration, total amount, target enterprise, etc., various methods can be applied to expand user queries easing the access to the information and providing a more accurate information retrieval system. Nevertheless expanded user queries can involve an extra-time in the process of retrieving notices. That is why a performance evaluation is outlined to tune up the semantic methods and the generated queries providing a scalable and time-efficient system. Moreover, this platform is supposed to be especially relevant for SMEs that want to tender in the European Union (EU), easing their access to the information of the notices and fostering their participation in cross-border public procurement processes across Europe. Finally an example of use is provided to evaluate and compare the goodness and the improvement of the proposed platform with regard to the existing ones.
Conference Paper
Full-text available
We address the problem of scalable distributed reasoning, proposing a technique for materialising the closure of an RDF graph based on MapReduce. We have implemented our approach on top of Hadoop and deployed it on a compute cluster of up to 64 commodity machines. We show that a naive implementation on top of MapReduce is straightforward but performs badly and we present several non-trivial optimisations. Our algorithm is scalable and allows us to compute the RDFS closure of 865M triples from the Web (producing 30B triples) in less than two hours, faster than any other published approach.
Conference Paper
Full-text available
In previous work we have shown that the MapReduce framework for distributed computation can be deployed for highly scalable inference over RDF graphs under the RDF Schema semantics. Unfortunately, several key optimizations that enabled the scalable RDFS inference do not generalize to the richer OWL semantics. In this paper we analyze these problems, and we propose solutions to overcome them. Our solutions allow distributed computation of the closure of an RDF graph under the OWL Horst semantics. We demonstrate the WebPIE inference engine, built on top of the Hadoop platform and deployed on a compute cluster of 64 machines. We have evaluated our approach using some real-world datasets (UniProt and LDSR, about 0.9-1.5 billion triples) and a synthetic benchmark (LUBM, up to 100 billion triples). Results show that our implementation is scalable and vastly outperforms current systems when comparing supported language expressivity, maximum data size and inference speed.
Conference Paper
In this paper, we discuss optimisations of rule-based materialisation approaches for reasoning over large static RDF datasets. We generalise and re-formalise what we call the “partial-indexing” approach to scalable rule-based materialisation: the approach is based on a separation of terminological data, which has been shown in previous and related works to enable highly scalable and distributable reasoning for specific rulesets; in so doing, we provide some completeness propositions with respect to semi-naïve evaluation. We then show how related work on template rules – T-Box-specific dynamic rulesets created by binding the terminological patterns in the static ruleset – can be incorporated and optimised for the partial-indexing approach. We evaluate our methods using LUBM(10) for RDFS, pD* (OWL Horst) and OWL 2 RL, and thereafter demonstrate pragmatic distributed reasoning over 1.12 billion Linked Data statements for a subset of OWL 2 RL/RDF rules we argue to be suitable for Web reasoning.
Conference Paper
We present the architecture of an end-to-end semantic search engine that uses a graph data model to enable interactive query answer- ing over structured and interlinked data collected from many disparate sources on the Web. In particular, we study distributed indexing meth- ods for graph-structured data and parallel query evaluation methods on a cluster of computers. We evaluate the system on a dataset with 430 million statements collected from the Web, and provide scale-up experi- ments on 7 billion synthetically generated statements.
Conference Paper
Over a decade after RDF has been published as a W3C recommendation, publishing open and machine-readable content on the Web has recently received a lot more attention, including from corporate and governmental bodies; notably thanks to the Linked Open Data community, there now exists a rich vein of heterogeneous RDF data published on the Web (the so-called "Web of Data") accessible to all. However, RDF publishers are prone to making errors which compromise the effectiveness of applications leveraging the resulting data. In this paper, we discuss common errors in RDF publishing, their consequences for applications, along with possible publisher-oriented approaches to improve the quality of structured, machine-readable and open data on the Web.
Article
In this paper, we discuss the architecture and implementation of the Semantic Web Search Engine (SWSE). Following traditional search engine architecture, SWSE consists of crawling, data enhancing, indexing and a user interface for search, browsing and retrieval of information; unlike traditional search engines, SWSE operates over RDF Web data – loosely also known as Linked Data – which implies unique challenges for the system design, architecture, algorithms, implementation and user interface. In particular, many challenges exist in adopting Semantic Web technologies for Web data: the unique challenges of the Web – in terms of scale, unreliability, inconsistency and noise – are largely overlooked by the current Semantic Web standards. Herein, we describe the current SWSE system, initially detailing the architecture and later elaborating upon the function, design, implementation and performance of each individual component. In so doing, we also give an insight into how current Semantic Web standards can be tailored, in a best-effort manner, for use on Web data. Throughout, we offer evaluation and complementary argumentation to support our design choices, and also offer discussion on future directions and open research questions. Later, we also provide candid discussion relating to the difficulties currently faced in bringing such a search engine into the mainstream, and lessons learnt from roughly six years working on the Semantic Web Search Engine project.