ArticlePDF Available

Analyzing Broken Links on the Web of Data: An Experiment With DBpedia

Authors:

Abstract

Linked open data allow interlinking and integrating any kind of data on the web. Links between various data sources play a key role insofar as they allow software applications (e.g., browsers, search engines) to operate over the aggregated data space as if it was a unique local database. In this new data space, where DBpedia, a data set including structured information from Wikipedia, seems to be the central hub, we analyzed and highlighted outgoing links from this hub in an effort to discover broken links. The paper reports on an experiment to examine the causes of broken links and proposes some treatments for solving this problem.
Analyzing Broken Links on the Web of Data: an Experiment with
DBpedia
Enayat Rajabi
PhD candidate
Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km.
33.6, 28871 Alcalá de Henares, Spain. Email: enayat.rajabi@uah.es
Salvador Sanchez-Alonso
Associate professor
Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km.
33.6, 28871 Alcalá de Henares, Spain. Email: salvador.sanchez@uah.es
Miguel-Angel Sicilia
Full professor
Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km.
33.6, 28871 Alcalá de Henares, Spain. Email: msicilia@uah.es
Linked Open Data enables interlinking and integrating
any kind of data in the Web. Links between various data
sources play a key role as they allow software applications
(e.g., browsers, search engines) to operate over
the aggregated data space as if it was a unique local
database. In this new data space, where DBpedia –a
dataset including structured information from Wikipedia–
seems to be the central hub, we analyzed and highlighted
outgoing links from this hub in an effort to discover
broken links. The paper reports on an experiment to
examine the causes for broken links, and proposes some
treatments for solving this problem.
Introduction
The Linked Data approach (Bizer, Heath, & Berners-Lee,
2009), as an innovative way of integrating and interlinking
different kinds of information on the Web of Data, conjoins
structured data in order to be utilized by machines. This
approach extends sharing data via Uniform Resource
Identifiers (URIs) so that institutions and data publishers can
leverage it to link their data to useful external datasets
(Fernandez, d’Aquin, & Motta, 2011). The Linked Open Data
(LOD)i cloud is a set of databases from various domains that
have been translated into RDF and linked to other datasets by
setting RDF links between data sources. It nowadays
conforms a huge collection of interlinked data aimed at
improving search and discovery of related data on the Web.
Specifically, Linked Data terms are served as information
resources and addressed via URIs so that they become
discoverable. The URI is a generic means to identify and
describe not just digital entities (e.g., electronic documents,
metadata), but also real world objects and abstract concepts
(e.g., people, places). URIs are represented as
dereferenceable, meaning that software agents and Linked
Data client applications can look up the URI using the HTTP
protocol and retrieve a description of the entity represented as
RDF. URI references of links between data items are expected
to be “cool”ii so that the target becomes accessible by the
source. Hence, links between data items play an essential role
in the LD approach (Popitsch & Haslhofer, 2010). In this
context, broken links (Eysenbach & Trudel, 2005) become
problematic, specially for data consumers which will not be
able to access the desired resources. In the “traditional” Web,
broken links in websites have negative effects on the search
engine rankings, but in the Linked Data cloud, they may lead
to splitting the linked datasets, which prevents machines to
follow the URIs to retrieve further relevant data.
In this paper, we evaluate the phenomenon of broken links
in the LOD cloud, focusing on the external links of DBpediaiii,
as a hub commonly used to browse and explore the Web of
Data. For this purpose, we used a link checking engine, and
studied the impact of broken links on the interlinked datasets
from the DBpedia perspective. DBpedia extracts structured
information from Wikipedia, interlinks it with other datasets
and publishes the results using Linked Data conventions and
SPARQL (Morsey et al., 2012).
The rest of this paper is structured as follows. Section 2
highlights the importance of link integrity in LOD and
explains how the “broken link” phenomenon can negatively
affect the consolidation of the Web of Data. In Section 3 we
focus on the importance of the DBpedia dataset as our testing
case and we report measures of its broken links. Conclusions
and outlook are finally provided in Section 4.
Problem statement and related work
Link integrity has always been a significant factor for
discovering and exploring data in the Web (Davis, 1999). It
aims to ensure validating a link, regardless if it points to a
target inside a given dataset or to external datasets. Link
integrity becomes especially important when data publishers
interconnect their data to external data sources on the Web of
Data. In particular, integrating links between datasets is
usually established either automatically by using interlinking
tools or manually by data publishers. Different types of
interlinking tools e.g., User Contributed Interlinking
(Hausenblas, Halb, & Raimond, 2008) build semantic links
between datasets relying on user contributions such as
matching two items manually, while some others e.g., RDF-
IA (Scharffe, Liu, & Zhou., 2009), Silk Link Discovery
Framework (Bizer, Volz, Kobilarov, & Gaedke, 2009), and
LIMES (Ngonga Ngomo & Auer, 2011) work automatically
according to the directions provided by the user configuration.
Link integrity between data sources becomes defective when a
link target is deleted or moved to a new location, making it
impossible that data browsing tools (e.g., search engines)
follow the links to reach the target data. Furthermore, broken
links annoy human end-users and force them to either not use
the contents of data provider or to manually look up the
intended target using a search engine.
Several causes can produce broken links in a dataset,
including the following:
The data source points to targets that do not exist
anymore. In this case, the request is answered, but the
specific resource cannot be found.
The source that hosts the target data stops working or
redirects to a new location.
Authoring issues on the target side block the access to
the linked resource, thus preventing to reach the
desired data.
The target responds but does not return the data fast
enough and the browser times out.
Human errors, e.g., misspelling the link, something that
mostly occurs when interlinking is carried out
manually.
Preventing broken links by data publishers is the simplest
and preferable method to fix the problem. Data providers can
identify the external links manually at the time of publishing
datasets. This approach is applicable where data providers can
maintain and monitor all internal and external links. In a
decentralized Linked Open Data system, this would be
impossible, as many external links are controlled by other data
publishers.
Several studies have attempted to identify and remedy the
broken link problem. Vesse, Hall & Carr (2010) introduced an
algorithm for retrieving linked data about a URI when the URI
is not resolvable. Vesse also proposed a system which allows
users to monitor and preserve linked data they are interested
using the expansion algorithm. Popitsch & Haslhofer (2010)
presented DSNotify, a tool for detecting and fixing broken
links, which can keep links between LOD datasets consistent
with required user input. Lui and Li (2011) proposed an
approach which relies on the metadata of data sources to track
the data changes by capturing the modifications of the data in
real time, adjusting notification timing with different
requirements from the data consumers. However, none of
these studies has been done in the LOD context, in an attempt
to estimate the impact and potential causes for the problem.
In terms of Web analysis, some studies have been also
conducted in the context of constructing and analyzing of
World Wide Web graph. Deo & Gupta (2003) investigated
several graph models of the “traditional” Web and made use
of graph-theoretic algorithms that help explaining and
structuring the growth of the Web. Serrano et al. (Serrano,
Maguitman, Boguñá, Fortunato, & Vespignani, 2007) also
analyzed the “traditional” Web by collecting content provided
by web crawlers and representing them in four different
clusters, depending on the strength of their relationships. They
illustrated the data in different graphs and examined them
according to size, node degree and degree correlations.
Experimental setting
There has been a considerable growth of datasets, which
are different in subject and size. In order to provide a more
precise depiction of LOD datasets, we studied statistical
measures from a graph perspective. The measures are taken
from the toolkit of Social Network Analysis (SNA), as a way
to analyze the LOD cloud from the perspective of a wide
network of interactions between entities. However, these
measures are not exclusive of SNA but they were also used in
previous graph analysis research of the World Wide Web.
Linked datasets, in such a social network, were mapped to
nodes i.e. entities– while links between datasets were
reconsidered as relationships between those entities.
There exists a wide range of SNA metrics which allow
researchers to integrate and analyze mathematical and
substantive dimensions of a network structure formed as a
result of ties formed between persons, organizations, or other
types of nodes (Wasserman, 1994) (Scott, 2000). Some of
these metrics are:
Betweenness Centrality (BC), which measures how often
a node appears on the shortest path between two other
nodes. High betweenness nodes are usually key players
in a network or a bottleneck in a communication network.
Thus, it is used for detecting important nodes in graphs.
Degree, the count of connections a node has with other
nodes including self-connections. It is the most common
topological metric in networks.
Edge weight, a number assigned to each edge that
represents how strong the relationship between two
nodes in a graph is.
We made use of SNA metrics to graphically highlight LOD
network, to get insights of the arrangement of LOD datasets
and to evaluate their properties from a mathematical
perspective. In consequence, we collected all the information
about LOD datasets from CKAN iv by using a software
component that exploited the links between datasets.
Gathering information about 337 datasets with almost 450
million links, we aligned all the collected information in a
data matrix. It should be noted that the collected data was
curated by the maintainers of the datasets, and thus it can be
regarded as a reliable estimation.
We used case-by-affiliation matrix (Wasserman, 1994), a
general form of data matrices for social networks, in which
the rows and columns refer to LOD datasets and the values are
the number of outgoing links of each dataset. The data were
imported in a SNA tool, in this case NodeXLv (Hansen,
Shneiderman, & Smith, 2010), following SNA metrics applied
to recognize the central datasets in the LOD, as these metrics
appear to enclose the relevant correlations of a graph.
Betweenness Centrality (BC): If a dataset has a high BC
value, then many datasets are connected through it to
others, which implies that the dataset plays an important
role in the LOD cloud.
Degree: illustrates the number of datasets that are
conjoined to the current node (dataset).
Edge weight: illustrates the number of links between two
datasets.
Table 1 illustrates the top five datasets with higher BC
along with their degree values. As the table shows, LOD
graph has been regarded as a directed graph. Incoming degree
values refer to the number of datasets that point to the current
dataset, while the outgoing degree stands for the number of
datasets pointed to by the current dataset. DBpedia, as it has
illustrated in the table, shows the highest BC, as it is in the
middle of many paths between other datasets. Geonames
provides global place names, their location and some
additional information such as population. Hence, it includes
global information that referred by 55 datasets and has no
outgoing links; nevertheless, it is the second dataset with high
BC.
Table 1: Top five datasets with high betweenness centrality
Dataset
In
-
Degree
Out
-
Degree
Betweenness
Centrality
DBpedia 181 30 82,664.24
Geonames 55 0 10,958.12
DrugBank 8 12 7,446.53
Bio2rdf-goa 11 8 3,751.97
Ordance-survey 16 0 3,272.72
With all the information extracted, we represented the data
in NodeXL, an open-source template for Microsoft Excel that
facilitates to explore network graphs. Once we had all datasets
in NodeXL, we filtered out those with less than 2 incoming
links (226 datasets out of 337) to depict the final graph in a
more understandable way. Figure 1 illustrates the generated
LOD graph using a “Harel-Koren Fast Multiscale” layout. As
the figure shows, DBpedia is in the centre of the LOD cloud,
thus acting as a hub in the network. This was our main
motivation to go deeper in the analysis of DBpedia instead of
analyzing other datasets. In fact, we selected DBpedia as our
case study only after a careful graph analysis and examination
of its external links, which finally persuaded us that this could
be properly considered the central dataset of the LOD cloud.
Figure 1 also shows why some datasets such as Geonames and
Drugbank have high BC among the LOD datasets, as either
they have been in the middle of a path or they were pointed by
other datasets.
Figure 1: DBpedia as a hub in the LOD cloud graph
NodeXL also allows experts clustering a graph in different
groups (Hansen et al., 2010) and we used this feature to
confirm that DBpedia was indeed the central dataset in
comparison to grouped datasets.
As it is well known, the DBpedia dataset contains
structured information extracted from Wikipedia with the aim
of making this information available on the Web of Data, as
well as linking to other datasets. It includes structured data
about persons, places and organizations which features labels
and abstracts for 10.3 million unique things in 111 different
languages (Bizer, Lehmann, et al., 2009); the full DBpedia
dataset features almost 36 million data linksvi to external RDF
datasets. Links to some datasets such as the Flickr wrapprvii
and Freebaseviii have been automatically created by using
software tools like Silk (Bizer, Volz, et al., 2009). According
to the number of links between DBpedia and other datasets in
the LOD cloud, the Flickr wrappr is the first dataset with more
than 31 million links to DBpedia, while Freebase is the second
one with 3.6 million links. Apart from these two datasets,
Figure 2 illustrates the percentage of DBpedia outgoing links
where “Others” in the figure (with 3%) refers to the datasets
included less than 10 thousand links with DBpedia.
Figure 2: DBpedia outgoing links
As mentioned earlier, several problems can cause a broken
link, all of which must be carefully checked. To examine the
availability of the links, we programmed a link checker
component that retrieved the HTTP response headers of the
URLs. Particularly, a primarily broken link will return a
HTTP 400 or 404 codes, indicating that there is an error with
the target and thus it is unreachable. We clustered all the
HTTP responses into several groups, such as server is
unreachable”, “time out”, and “non existing record”. The
workflow in Figure 3 shows how the tool checks every
outgoing link of the DBpedia. The results afterwards were
inserted into a database to be later evaluated. In addition to
validating a URL, the software set the response timeout to 10
seconds, which means that the lack of any server activity of
the target for this duration was considered to be an error.
The problem with this approach to link checking is that it
may happen that some target data cannot be fetched at the
time of the request, but the same data might be available again
in future, which requires to periodically run the link checking
component. We examined the availability of over 1.67 million
links of DBpedia based on the schedule presented in Table 2.
This schedule was started on January 2013 and followed over
4 months in order to analyze the links precisely.
Table 2: Link checking schedule
Month Link checking dates
January 8th, 18th, and 28th
February 1st, 11th, and 28th
March 4th, 14th, and 24th
April 3rd, 13th, and 23rd
We filtered out both the Flickr wrappr and Freebase
datasets as they include millions of links to DBpedia, which
caused the link checking process to become too time-
consuming for our computational capabilities. The target time
of response was also checked, assuming that the link was live
if the target responded before 10 seconds. We did not discover
any authorization problem among the analyzed URLs, as
some links may be unreachable due to security reasons.
Figure 3: Link checker workflow
Table 3 illustrates the external datasets of the DBpedia
along with the number of links and the average number of
broken links detected during the scheduled process. The forth
column in the table shows the average number of broken links
relative to the dataset size (illustrated per 100,000 triples). We
listed more detailed information about the datasets in
Appendix 1. Table 3 shows how, for example, the
Italian_Public_school dataset comprises a total of 169,000
triples (around 1.69 triples per 100,000) of which 5,822 were
broken. The number 1,148 in the table (dividing 5,822 into
1.69) shows the status of the dataset from the availability
perspective. Diseasome, as another example, is supposed to be
very problematic as it has only 91,000 triples, 2,301 of which
were not reachable through DBpedia.
Table 3: DBpedia related datasets
Dataset Links
number Average # of
broken links Average # related
to dataset size
Revyu 6 0 0
GHO 196 0 0
BBCwildlife 444 0 0
Amsterdam
Museum 627 0 0
Openei 678 0 0
-
musicbrainz 838 0 0
Eunis 3,079 0 0
linkedmdb 13,758 0 0
Bricklink 10,090 1 0
Uscensus 12,592 1 0
Bookmashup 8,903 4 0
WordNet
437
,
79
6 6 0
Eurostat 490 15 (3%) 0
geospecies 15,974 41 2
Factbook 545 5 13
Nytimes 9,678 55 16
Dailymed 894 150 (17%) 91
TCM 904 151(17%) 128
Gadm 1,937 163 (8%) Unspecified ix
DBLP 196 196 (196%) 1
Wikicompan
y 8,348 199 (2%) Unspecified
Cordis 314 314 (100%) 4
Geonames 86,547 336 0
Umbel
891
,
82
2 1005 210
Italian_publi
c_schools 5,822 1,940 (33%) 1,148
Gutendata 2,511 2,100 (84%) 21,000
Diseasome 2,301
2
,
301
(100%) 2,529
DrugBank 4,845 4,745 (98%) 619
Musicbrainz 22,980
22
,
980
(100%) 38
Opencyc 27,107
27
,
107
(100%) 1,694
Linkedgeoda
ta
103
,
61
8
37
,
791
(36%) 175
Figure 4 shows the number of broken outgoing links in
DBpedia by dataset (only datasets with more than 100 links).
Figure 4: DBpedia broken outgoing links
An analysis was later carried out on the logs of the link
checker with the aim of discovering the implications of the
broken links problem. Figure 5 shows that more than 55% of
the external links identified as broken were due either to the
fact that the service exposing the dataset was down, or the
server was not reachable. Nearly 32% of the total amount of
external links referred to the targets that did not return the data
in 10 seconds and the browser timed out. Furthermore, more
than 10% of DBpedia links pointed to records that did not
exist in the target dataset. Finally, the related services of a
small percentage of the broken links (around 2%) were
temporarily unavailable.
Figure 5: percentage of causes of broken links
Table 4 also illustrates each dataset along with the type of
the error we faced during the links checking.
Table 4: DBpedia related datasets
Dataset Cause of failure
Geospecies Non existing record
Nytimes Non existing record
Dailymed Service temporarily unavailable
TCM Timeout
Gadm Service is down or unreachable
DBLP Service temporarily unavailable
Wikicompany Non existing record
Cordis Timeout
Geonames Non existing record
Umbel Non existing record
Italian_public_schools Service is down or unreachable
Gutendata Service is down or unreachable
Diseasome Service temporarily unavailable
Musicbrainz Service is down or unreachable
Opencyc Timeout
DrugBank Timeout
Linkedgeodata
Non existing record
/
Service is down or unreachable
Conclusions and outlook
By evaluating the results of our link checking system over
DBpedia, we classified the related datasets in different groups:
Live and fully accessible datasets (through DBpedia
links), such as Openei and Eunis.
Datasets that were only partially reachable and which
include links those were not accessible through
DBpedia. In particular, some data did not exist in the
external data sources anymore (e.g., Wikicompany and
Geonames)
Datasets which were fully broken. Our analysis shows
that the hosts of these datasets were not reachable either
temporarily or permanently (e.g., Cordis and
Musicbrainz).
Datasets which were not reachable during a certain
period of time. Most often, the link checker could
successfully access the related host(s) during a
posterior checking process (e.g., TCM, Dailymed).
Some datasets (e.g., opencyc, Diseasome) did not
provide a linked data access including publishing
dereferenceable URIs, a SPARQL endpoint, and RDF
dump.
A number of datasets (e.g., Musicbrainz) did not
provide support, as it seems they were the result of
research projects working on a voluntary or project-
bound basis (e.g., individuals, and universities). Given
the way they were managed, it is uncertain whether
they will continue operating in the long term or at least
providing free access services.
A manual evaluation of the links found that all the broken
links of the Umbel dataset belonged to one URL. In particular,
only one link in the target dataset was unreachable through
around 1,000 links. In addition and with regard to what Table
1 illustrates, the DrugBank dataset, which had a high BC
value among other datasets, even though around 98% of the
outgoing links to this dataset were broken.
With respect to the results, we examined the most common
current approaches for dealing with broken links in the LOD
datasets. Data publishers can fix the problem automatically by
using a link checking component. A link checker can be also
applied to a data source to periodically detect and fix the
broken links. Data consumers can also report manually the
broken links to the data providers for them to resolve this
problem. This solution is ineffective and slow, though. Other
solutions such as the handle system (Sun, 2001) or PURL
provide a number of services for unique and permanent
identifiers of digital objects when web infrastructure has been
changed and enable data publishers to store identifiers of
arbitrary resources. Data providers can utilize those services
to resolve identifiers into the information necessary to access,
to later authenticate and update the current state of the
resource without changing its identifier. This provides the
benefit to allow the name of the item to persist over changes
of e.g. location.
The research presented here in can be extended by going
further in checking logs and examining the links manually.
Specifically it can be analyzed for those datasets redirected to
another target in terms of their availability and cause of the
redirection.
There is also a wide variety of datasets published in the
LOD cloud. Similar to outgoing links of the DBpedia, link
checking could be extended to other datasets as well. In
particular, the whole status of the LOD datasets, in terms of
link availability, can be achieved by examining all the
outgoing links for each dataset. As a result, the LOD cloud
could be traced in the case of broken links and a central
reposting system, for example as a part of LOD stat portal,
can help datasets to fix their broken links.
Acknowledgements
The work presented in this paper has been part-funded by
the European Commission under the ICT Policy Support
Programme CIP-ICT-PSP.2011.2.4-e-learning with project No.
297229 “Open Discovery Space (ODS)”, CIP-ICT-
PSP.2010.6.2- Multilingual online services with project No.
27099 “Organic.Lingua”, and INFRA-2011-1.2.2-Data
infrastructures for e-Science with project No.
283770 “AGINFRA”.
References
Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked Data - The Story So
Far. International Journal on Semantic Web and Information
Systems, 5(3), 1–22.
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., &
Hellmann, S. (2009). DBpedia - A crystallization point for the
Web of Data. Web Semant., 7(3), 154–165.
Bizer, C., Volz, J., Kobilarov, G., & Gaedke, M. (2009). Silk - A Link
Discovery Framework for the Web of Data. In the 18th
International World Wide Web Conference, Madrid, Spain.
Davis, H. C. (1999). Hypertext link integrity. ACM Comput. Surv., 31(4es).
Deo, N., & Gupta, P. (2003). Graph-Theoretic Analysis of the World Wide
Web: New Directions and Challenges. Mathematica
Contompornea,Sociedade Brasileira de Matemática,, 25, 49–69.
Eysenbach, G., & Trudel, M. (2005). Going, Going, Still There: Using the
WebCite Service to Permanently Archive Cited Web Pages.
Journal of Medical Internet Research, 7(5), e60.
Fernandez, M., d’Aquin, M., & Motta, E. (2011). Linking data across
universities: an integrated video lectures dataset. In Proceedings of
the 10th international conference on The semantic web - Volume
Part II (pp. 49–64). Berlin, Heidelberg: Springer-Verlag.
Hansen, D., Shneiderman, B., & Smith, M. A. (2010). Analyzing Social
Media Networks with NodeXL: Insights from a Connected World
(1st ed.). Morgan Kaufmann.
Hausenblas, M., Halb, W., & Raimond, Y. (2008). Scripting User Contributed
Interlinking. In Proceedings of the 4th workshop on Scripting for
the Semantic Web (SFSW2008), co-located with ESWC2008.
Liu, F., & Li, X. (2011). Using Metadata to Maintain Link Integrity for
Linked Data. In Proceedings of the 2011 International Conference
on Internet of Things and 4th International Conference on Cyber,
Physical and Social Computing (pp. 432–437). Washington, DC,
USA: IEEE Computer Society.
Morsey,M. Lehmann, J., Auer, S., Stadler, C. and Hellmann, S. (2012)
DBpedia and the live extraction of structured data from Wikipedia,
Program: electronic library and information systems, 46(2), pp.157
- 181
Ngonga Ngomo, A.-C., & Auer, S. (2011). LIMES — A Time-Efficient
Approach for Large-Scale Link Discovery on the Web of Data. In
Twenty-Second International Joint Conference on Artificial
Intelligence.
Popitsch, N. P., & Haslhofer, B. (2010). DSNotify: handling broken links in
the web of data. In Proceedings of the 19th international
conference on World wide web (pp. 761–770). New York, NY,
USA: ACM.
Scharffe, F., Liu, Y., & Zhou., C. (2009). RDF-AI: an architecture for RDF
datasets matching, fusion and interlink. In Proceedings of IJCAI
2009 IR-KR Workshop.
Scott, J. (2000). Social Network Analysis: A Handbook. SAGE publication.
Serrano, M. Á., Maguitman, A., Boguñá, M., Fortunato, S., & Vespignani, A.
(2007). Decoding the structure of the WWW: A comparative
analysis of Web crawls. ACM Trans. Web, 1(2).
Sun, S. (2001). Establishing persistent identity using the handle system. In
Proceedings of the Tenth International World Wide Web
Conference.
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and
applications. New York: Cambridge University Press.
Vesse, R., Hall, W., & Carr, L. (2010). Preserving Linked Data on the
Semantic Web by the application of Link Integrity techniques
from Hypermedia. In, Linked Data on the Web (LDOW2010),
Raleigh, NC.
Appendix 1
Dataset name URL Subject
Revyu http://revyu.com/ review and rate things
Gho http://gho.aksw.org/
statistical data
for health problems
BBC WildLife http://www.bbc.co.uk/wildlifefinder/ Nature
Amsterdam Museum http://semanticweb.cs.vu.nl/lod/am/ Culture
OpenEI http://en.openei.org/ energy information
Dbtune http://dbtune.org/musicbrainz/ Music
Eunis http://eunis.eea.europa.eu biodiversity
LinkedMDB http://linkedmdb.org/ Movie
Bricklink http://kasabi.com/dataset/bricklink Marketing
Uscenus http://www.rdfabout.com/demo/census/ Population statistics
Bookmashup http://www4.wiwiss.fu-berlin.de/bizer/bookmashup/ Book
Factbook http://www4.wiwiss.fu-berlin.de/factbook/ Countries
WordNet http://www.w3.org/TR/wordnet-rdf lexical database of English
Eurostat http://eurostat.linked-statistics.org/ European Statistics
geospecies http://lod.geospecies.org/ GeoSpecies
Nytimes http://data.nytimes.com/ News
Dailymed http://www4.wiwiss.fu-berlin.de/dailymed/ Drugs
TCM http://code.google.com/p/junsbriefcase/wiki/TGDdataset medicines
Gadm http://gadm.geovocab.org/ GIS
DBLP http://www4.wiwiss.fu-berlin.de/dblp/ Book
Wikicompany http://wikicompany.org/ business
Cordis http://www4.wiwiss.fu-berlin.de/cordis/ EU programmes and projects
Geonames http://www.geonames.org/ontology/ geography
Umbel http://umbel.org/ technology and semantics
Italian_public_schools http://www.linkedopendata.it/datasets/scuole schools
Gutendata http://www4.wiwiss.fu-berlin.de/gutendata/ ebook
Diseasome http://www4.wiwiss.fu-berlin.de/diseasome/ disease
DrugBank http://wifo5-03.informatik.uni-mannheim.de/drugbank Drugs
Musicbrainz http://zitgist.com/ Music
Opencyc http://sw.opencyc.org/ diverse collection
of real-world concepts in
OpenCyc
Linkedgeodata http://linkedgeodata.org/ geography
Endnote
i http://lod-cloud.net/ (Retrieved 2013-06-22)
ii See http://www.w3.org/Provider/Style/URI
iii http://dbpedia.org/About (Retrieved 2013-06-22)
iv http://ckan.org/ (Retrieved 2013-06-22)
v http://nodexl.codeplex.com/ (Retrieved 2013-06-22)
vi http://wiki.dbpedia.org/About (Retrieved 2013-06-22)
vii http://wifo5-03.informatik.uni-mannheim.de/flickrwrappr/ (Retrieved 2013-06-22)
viii http://www.freebase.com (Retrieved 2013-06-22)
ix The size of the dataset was unavailable both in the LOD database and through the provider’s website
... In a nutshell, although eforts made in openness and linked data are shown, the low metadata quality and the weak application of the best practices of availability and reuse have created barriers that discourage the growth and use of the Web of Data. Although nonproprietary formats and reachable URLs [56] are used, data-publishing problems were identifed. Issues, such as the nonupdated datasets, the diversity of published formats, the nonproper flling out tags, the low availability of end-user-friendly tools, the poor institutional policy oriented to linked data, and the lack of guidelines for data providers, were identifed. ...
Article
Full-text available
Open data has been improving both publishing platforms and the consumers-oriented process over the years, providing better openness policies and transparency. Although organizations have tried to open their data, the enrichment of their resources through the Web of Data has been decreasing. Linked data has been suffering from notable difficulties in different stages of its life cycle, becoming over the years less attractive to users. According to that, we decided to explore how the lack of some opening requirements affects the decline of the Web of Data. This paper presents the Web of Data radiography, analyzing the governmental domain as a case study. The results indicate that it is necessary to strengthen the data opening process to improve resource enrichment on the Web and have better datasets. These improvements describe that open data must be public, accessible (in machine-readable formats), described (use of robust, granular metadata), reusable (made available under an open license), complete (published in primary forms), and timely (preserve the value of the data). The implementation of these characteristics would enhance the availability and reuse of datasets. Besides, organizations must understand that opening and enriching their data require a completely new approach, and they have to pay special attention and control to this project, generally by putting money, the commitment by management at all levels, and lots of time. On the contrary, given the magnitude of availability and reuse problems identified in the opening and enrichment data process, it is believed that the Web of Data model would inevitably lose the interest it aroused at the beginning if not addressed immediately by data quality, openness, and enrichment issues. Besides, its use would be restricted to a few particular niches or would even disappear altogether.
... This approach is based on interoperability across services and servers [12,20,21]. However, many of these technologies are still hindered by several drawbacks, such as the existence of points of failure [22] and control [23], or the lack of interoperability of the data beyond a few applications [14,21]. ...
Article
Full-text available
The current state of the web, which is dominated by centralized cloud services, raises several concerns regarding different aspects such as governance, privacy, surveillance, and security. A way to address these issues is to decentralize the platforms by adopting new distributed technologies, such as IPFS and Blockchain, which follow a full peer-to-peer model. This work proposes a set of guidelines to design decentralized systems, taking the different trade-offs these technologies face with regard to their consistency requirements into consideration. These guidelines are then illustrated with the design of a decentralized questions and answers system. This system serves to illustrate a framework to create decentralized services and applications that uses IPFS and Blockchain technologies and incorporates the discussion and guidelines of the paper, providing solutions for data access, data provenance, and data discovery. Thus, this work proposes a framework to assist in the design of new decentralized systems, proposing a set of guidelines to choose the appropriate technologies depending on the relevant requirements; e.g., considering if Blockchain technology may be required or IPFS might be sufficient.
... However, it has not been studied directly for gaining insights into commonsense knowledge, even though there is precedent. For example, studies on DBpedia and YAGO have been conducted specifically to understand their relational structure and the structural properties of the encyclopedic knowledge that these KGs are known for [50], [52], [13]. We attempt to do the same, but with commonsense knowledge as the focus. ...
Preprint
Full-text available
Acquiring commonsense knowledge and reasoning is an important goal in modern NLP research. Despite much progress, there is still a lack of understanding (especially at scale) of the nature of commonsense knowledge itself. A potential source of structured commonsense knowledge that could be used to derive insights is ConceptNet. In particular, ConceptNet contains several coarse-grained relations, including HasContext, FormOf and SymbolOf, which can prove invaluable in understanding broad, but critically important, commonsense notions such as 'context'. In this article, we present a methodology based on unsupervised knowledge graph representation learning and clustering to reveal and study substructures in three heavily used commonsense relations in ConceptNet. Our results show that, despite having an 'official' definition in ConceptNet, many of these commonsense relations exhibit considerable sub-structure. In the future, therefore, such relations could be sub-divided into other relations with more refined definitions. We also supplement our core study with visualizations and qualitative analyses.
... For example, studies on DBpedia and YAGO have been conducted specifically to understand [7] https://www.cyc.com/. their relational structure and the structural properties of the encyclopedic knowledge that these KGs are known for [48], [49], [50]. We attempt to do the same, but with commonsense knowledge as the focus. ...
Preprint
Full-text available
Acquiring commonsense knowledge and reasoning is recognized as an important frontier in achieving general Artificial Intelligence (AI). Recent research in the Natural Language Processing (NLP) community has demonstrated significant progress in this problem setting. Despite this progress, which is mainly on multiple-choice question answering tasks in limited settings, there is still a lack of understanding (especially at scale) of the nature of commonsense knowledge itself. In this paper, we propose and conduct a systematic study to enable a deeper understanding of commonsense knowledge by doing an empirical and structural analysis of the ConceptNet knowledge base. ConceptNet is a freely available knowledge base containing millions of commonsense assertions presented in natural language. Detailed experimental results on three carefully designed research questions, using state-of-the-art unsupervised graph representation learning ('embedding') and clustering techniques, reveal deep substructures in ConceptNet relations, allowing us to make data-driven and computational claims about the meaning of phenomena such as 'context' that are traditionally discussed only in qualitative terms. Furthermore, our methodology provides a case study in how to use data-science and computational methodologies for understanding the nature of an everyday (yet complex) psychological phenomenon that is an essential feature of human intelligence.
... These researches raise the need for tools to detect possible quality problems and ambiguities produced by redundancy, inconsistencies and lack of completeness of data and links. Open link problems and strategy in order to solve this problem are described in [16,17]. Quantity of linked data available as of July 2009 and the number of links between RDF dataset are shown in [18]. ...
Article
Full-text available
Linked Open Data ha sido una iniciativa orientada a ofrecer una serie de principios para lainterconexi ́on de datos mediante estructuras legibles por m ́aquinas y esquemas de representaci ́onde conocimiento. En la actualidad existen plataformas que permiten consumir este tipo derecursos LOD, siendo CKAN una de las m ́as relevantes sobre una gran comunidad conformadapor organizaciones gubernamentales, ONGs, entre otras. Sin embargo, el consumo de estosrecursos carece de criterios m ́ınimos para determinar la validez de los mismos tales como: nivelde confianza, calidad, vinculaci ́on y usabilidad de los datos; aspectos que requieren de un an ́alisissistem ́atico previo sobre el conjunto de datos publicados. Para apoyar este proceso de an ́alisis ydeterminaci ́on de los criterios mencionados, el presente art ́ıculo tiene como prop ́osito presentarun m ́etodo que permita analizar el estado actual de los dataset obtenidos desde las distintasinstancias publicadas en CKAN, con el prop ́osito de evaluar los niveles de confianza que puedenofrecer desde sus fuentes de origen. Finalmente, presenta resultados, conclusiones y trabajofuturo a partir del uso de la herramienta para el consumo de conjuntos de datos pertenecientesa ciertas instancias adscritas a la plataforma CKAN.
... These researches raise the need for tools to detect possible quality problems and ambiguities produced by redundancy, inconsistencies and lack of completeness of data and links. Open link problems and strategy in order to solve this problem are described in [16,17]. Quantity of linked data available as of July 2009 and the number of links between RDF dataset are shown in [18]. ...
Article
Full-text available
Linked Open Data has been an initiative aimed at offering principles for the interconnection of data through machine-readable structures and knowledge representation schemes. At present, there are platforms that allow consuming LOD resources, being CKAN one of the most relevant on a large community made up of governmental organizations, NGOs, among others. However, the resources consumption lacks minimum criteria to determine their validity such as level of trust, quality, linkage and usability of the data; aspects that require a previous systematic analysis on the set of published data. To support this process of analysis and determination of the mentioned criteria, this paper has as purpose to present a method that allows analyzing the dataset current state obtained from the different instances published in CKAN, with the aim of evaluating the levels of trust that can offer from their sources. Finally, it presents results, conclusions and future work from the use of the tool for the dataset consumption belonging to certain instances ascribed to the CKAN platform.
... Open data helps develop the innovation potential of governments, businesses and entrepreneurs that can provide economic, social and scientific gains [3]- [5]. Additionally, some authors highlight the new opportunities for innovation in public and private sectors that big and open linked data have created [6], [7]; for example, facilitating the generation of new software applications by interconnecting data from different sources on the web [8]. ...
Article
Full-text available
Open data and open innovation are two topics currently attracting the attention of academics. But no previous studies consider these fields in combination while using a bibliometric approach. Thus, the aim of this paper is to understand the relationship between open innovation and open data. Two research questions have been formulated: 1) What are the main topics studied in the literature that combine both lines of research? and 2) How can the open innovation paradigm be integrated in the open data impact process? To address the first question, a co-word analysis is used to identify the main topics investigated in the open innovation and open data literature. Based on our results, to answer the second research question, the topics are grouped and analyzed considering a model of the open data impact process. Finally, some future research lines to analyze the open data impact process for open innovation are presented. For example, future research could focus on questions such as (1) What kind of applications can be created through the reuse of open data?; and (2) How do open innovation processes influence the reuse of open data?.
Article
Full-text available
The recent and significant change in the architecture, engineering, and construction (AEC) industry is the increasing use of building information management (BIM) tools in design and how the generated digital data are consumed. Although designers have primarily published blueprints and model files, multiple parties are now involved in producing different aspects of the data and may be consuming data produced by others. This evolution has created new challenges in sharing and synchronizing information across organizational boundaries. Linked building data is a community effort identified as a potential means of overcoming the barriers to data interoperability. In this paper, we present how the industry can be strengthened using a peer-to-peer network based on the InterPlanetary File System (IPFS) framework to address the typical availability problems of web-based data. The paper particularly presents how Linked Data serialization of the Industry Foundation Classes (IFC) standard can be used in the context of IPFS and benchmarks the performance for the publication of models between IPFS versus HTTP protocol. We show that linked building data—in particular, IFC models converted into Resource Description Format (RDF) graphs according to the ifcOWL ontology—can be implemented using the framework, with initial indications of significant benefits of a peer-to-peer network in terms of performance, offline access, and immutable version history. Two use cases in distributed collaborative environments in the AEC/facility management (FM) sector using evolving multidomain models are used to evaluate the work.
Chapter
The Web of Linked Open Data (LOD) is a decentralized effort in publishing datasets using a set of conventions to make them accesssible, notably thought RDF and SPARQL. Links across nodes in published datasets are thus critical in getting value for the LOD cloud as a collective effort. Connectivity among the datasets can occur through these links. Equivalence relationship is one of the fundamental links that connects different schemas or datasets, and is used to assert either class or instance equivalence. In this article, we report an empirical study on the equivalences found in over 59 million triples from datasets accessible via SPARQL endpoints in open source data portals. Metrics from graph analysis have been used to examine the relationships between repositories and determine their relative importance as well as their ability to facilitate knowledge discovery.
Article
Full-text available
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Article
Full-text available
Purpose DBpedia extracts structured information from Wikipedia, interlinks it with other knowledge bases and freely publishes the results on the web using Linked Data and SPARQL. However, the DBpedia release process is heavyweight and releases are sometimes based on several months old data. DBpedia‐Live solves this problem by providing a live synchronization method based on the update stream of Wikipedia. This paper seeks to address these issues. Design/methodology/approach Wikipedia provides DBpedia with a continuous stream of updates, i.e. a stream of articles, which were recently updated. DBpedia‐Live processes that stream on the fly to obtain RDF data and stores the extracted data back to DBpedia. DBpedia‐Live publishes the newly added/deleted triples in files, in order to enable synchronization between the DBpedia endpoint and other DBpedia mirrors. Findings During the realization of DBpedia‐Live the authors learned that it is crucial to process Wikipedia updates in a priority queue. Recently‐updated Wikipedia articles should have the highest priority, over mapping‐changes and unmodified pages. An overall finding is that there are plenty of opportunities arising from the emerging Web of Data for librarians. Practical implications DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Many companies and researchers use DBpedia and its public services to improve their applications and research approaches. The DBpedia‐Live framework improves DBpedia further by timely synchronizing it with Wikipedia, which is relevant for many use cases requiring up‐to‐date information. Originality/value The new DBpedia‐Live framework adds new features to the old DBpedia‐Live framework, e.g. abstract extraction, ontology changes, and changesets publication.
Book
Businesses, entrepreneurs, individuals, and government agencies alike are looking to social network analysis (SNA) tools for insight into trends, connections, and fluctuations in social media. Microsoft's NodeXL is a free, open-source SNA plug-in for use with Excel. It provides instant graphical representation of relationships of complex networked data. But it goes further than other SNA tools -- NodeXL was developed by a multidisciplinary team of experts that bring together information studies, computer science, sociology, human-computer interaction, and over 20 years of visual analytic theory and information visualization into a simple tool anyone can use. This makes NodeXL of interest not only to end-users but also to researchers and students studying visual and network analytics and their application in the real world. In Analyzing Social Media Networks with NodeXL, members of the NodeXL development team up to provide readers with a thorough and practical guide for using the tool while also explaining the development behind each feature. Blending the theoretical with the practical, this book applies specific SNA instructions directly to NodeXL, but the theory behind the implementation can be applied to any SNA. To learn more about Analyzing Social Media Networks and NodeXL, visit the companion site at www.mkp.com/nodexl Walks readers through using NodeXL while explaining the theory and development behind each step, providing takeaways that can apply any SNA Demonstrates how visual analytics research can be applied to SNA tools for the mass market Presents readers with case studies using NodeXL on popular networks like email, Facebook, Twitter, and wikis.
Article
The World Wide Web is growing rapidly and revolutionizing the means of information access. It can be modeled as a directed graph in which a node represents a Web page and an edge represents a hyperlink. Currently, the number of nodes in this gigantic Web graph is estimated to be over ten billion, and is growing at more than seven million nodes a day – without any centralized control. Recent studies suggest that despite its chaotic appearance, the Web is a highly structured digraph, in a statistical sense. The study of this graph can provide insight into Web algorithms for crawling, searching, and ranking Web resources. Knowledge of the graph-theoretic structure of the Web graph can be exploited for attaining efficiency and comprehensiveness in Web navigation as well as enhancing Web tools, e.g., better search engines and intelligent agents. In this paper, we discuss various problems to be explored for understanding the structure of the WWW. Many research directions are identified such as structural analysis of the Web, design of search engines, network security, etc.
Article
The Linking Open Data (LOD) initiative has motivated numerous institutions to publish their data on the Web and to interlink them with those of other data sources. The structure makes it easier for both the users and the web applications to share the decentralized linked data. Since the data over the web changes frequently, maintaining link integrity to keep data consistency becomes a challenging problem for LOD. Many practices just ignore this problem and leave it to the applications which consume the linked data. Some approaches try to relive the applications from this issue by the active notification mechanism. In this paper, we propose an approach which relies on the metadata of the data sources to detect the change of the data. In this way, it can capture the changes of the data in real time and adjust the notification timing with different requirements of the consumers. By the initial experiments we show the feasibility of our approach.
Article
With the recent publication of large quantities of RDF data, the Semantic Web now allows concrete applications to be developed. Multi-ple datasets are effectively published according to the linked-data principles. Integrating these datasets through interlink or fusion is needed in order to assure interoperability between the resources composing them. There is thus a growing need for tools providing datasets man-agement. We present in this paper RDF-AI, a framework and a tool for managing the integra-tion of RDF datasets. The framework includes five modules for pre-processing, matching, fus-ing, interlinking and post-processing datasets. The framework inplementation results in a tool providing RDF datasets integration function-alities in a linked-data context. Evaluation of RDF-AI on existing datasets shows promising results towards a Semantic Web aware datasets integration tool.