Content uploaded by Enayat Rajabi
Author content
All content in this area was uploaded by Enayat Rajabi on Nov 14, 2017
Content may be subject to copyright.
Analyzing Broken Links on the Web of Data: an Experiment with
DBpedia
Enayat Rajabi
PhD candidate
Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km.
33.6, 28871 Alcalá de Henares, Spain. Email: enayat.rajabi@uah.es
Salvador Sanchez-Alonso
Associate professor
Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km.
33.6, 28871 Alcalá de Henares, Spain. Email: salvador.sanchez@uah.es
Miguel-Angel Sicilia
Full professor
Information Engineering Research Unit, Computer Science Department, University of Alcalá de Henares, Ctra. Barcelona km.
33.6, 28871 Alcalá de Henares, Spain. Email: msicilia@uah.es
Linked Open Data enables interlinking and integrating
any kind of data in the Web. Links between various data
sources play a key role as they allow software applications
(e.g., browsers, search engines) to operate over
the aggregated data space as if it was a unique local
database. In this new data space, where DBpedia –a
dataset including structured information from Wikipedia–
seems to be the central hub, we analyzed and highlighted
outgoing links from this hub in an effort to discover
broken links. The paper reports on an experiment to
examine the causes for broken links, and proposes some
treatments for solving this problem.
Introduction
The Linked Data approach (Bizer, Heath, & Berners-Lee,
2009), as an innovative way of integrating and interlinking
different kinds of information on the Web of Data, conjoins
structured data in order to be utilized by machines. This
approach extends sharing data via Uniform Resource
Identifiers (URIs) so that institutions and data publishers can
leverage it to link their data to useful external datasets
(Fernandez, d’Aquin, & Motta, 2011). The Linked Open Data
(LOD)i cloud is a set of databases from various domains that
have been translated into RDF and linked to other datasets by
setting RDF links between data sources. It nowadays
conforms a huge collection of interlinked data aimed at
improving search and discovery of related data on the Web.
Specifically, Linked Data terms are served as information
resources and addressed via URIs so that they become
discoverable. The URI is a generic means to identify and
describe not just digital entities (e.g., electronic documents,
metadata), but also real world objects and abstract concepts
(e.g., people, places). URIs are represented as
dereferenceable, meaning that software agents and Linked
Data client applications can look up the URI using the HTTP
protocol and retrieve a description of the entity represented as
RDF. URI references of links between data items are expected
to be “cool”ii so that the target becomes accessible by the
source. Hence, links between data items play an essential role
in the LD approach (Popitsch & Haslhofer, 2010). In this
context, broken links (Eysenbach & Trudel, 2005) become
problematic, specially for data consumers which will not be
able to access the desired resources. In the “traditional” Web,
broken links in websites have negative effects on the search
engine rankings, but in the Linked Data cloud, they may lead
to splitting the linked datasets, which prevents machines to
follow the URIs to retrieve further relevant data.
In this paper, we evaluate the phenomenon of broken links
in the LOD cloud, focusing on the external links of DBpediaiii,
as a hub commonly used to browse and explore the Web of
Data. For this purpose, we used a link checking engine, and
studied the impact of broken links on the interlinked datasets
from the DBpedia perspective. DBpedia extracts structured
information from Wikipedia, interlinks it with other datasets
and publishes the results using Linked Data conventions and
SPARQL (Morsey et al., 2012).
The rest of this paper is structured as follows. Section 2
highlights the importance of link integrity in LOD and
explains how the “broken link” phenomenon can negatively
affect the consolidation of the Web of Data. In Section 3 we
focus on the importance of the DBpedia dataset as our testing
case and we report measures of its broken links. Conclusions
and outlook are finally provided in Section 4.
Problem statement and related work
Link integrity has always been a significant factor for
discovering and exploring data in the Web (Davis, 1999). It
aims to ensure validating a link, regardless if it points to a
target inside a given dataset or to external datasets. Link
integrity becomes especially important when data publishers
interconnect their data to external data sources on the Web of
Data. In particular, integrating links between datasets is
usually established either automatically by using interlinking
tools or manually by data publishers. Different types of
interlinking tools e.g., User Contributed Interlinking
(Hausenblas, Halb, & Raimond, 2008) build semantic links
between datasets relying on user contributions such as
matching two items manually, while some others e.g., RDF-
IA (Scharffe, Liu, & Zhou., 2009), Silk Link Discovery
Framework (Bizer, Volz, Kobilarov, & Gaedke, 2009), and
LIMES (Ngonga Ngomo & Auer, 2011) work automatically
according to the directions provided by the user configuration.
Link integrity between data sources becomes defective when a
link target is deleted or moved to a new location, making it
impossible that data browsing tools (e.g., search engines)
follow the links to reach the target data. Furthermore, broken
links annoy human end-users and force them to either not use
the contents of data provider or to manually look up the
intended target using a search engine.
Several causes can produce broken links in a dataset,
including the following:
The data source points to targets that do not exist
anymore. In this case, the request is answered, but the
specific resource cannot be found.
The source that hosts the target data stops working or
redirects to a new location.
Authoring issues on the target side block the access to
the linked resource, thus preventing to reach the
desired data.
The target responds but does not return the data fast
enough and the browser times out.
Human errors, e.g., misspelling the link, something that
mostly occurs when interlinking is carried out
manually.
Preventing broken links by data publishers is the simplest
and preferable method to fix the problem. Data providers can
identify the external links manually at the time of publishing
datasets. This approach is applicable where data providers can
maintain and monitor all internal and external links. In a
decentralized Linked Open Data system, this would be
impossible, as many external links are controlled by other data
publishers.
Several studies have attempted to identify and remedy the
broken link problem. Vesse, Hall & Carr (2010) introduced an
algorithm for retrieving linked data about a URI when the URI
is not resolvable. Vesse also proposed a system which allows
users to monitor and preserve linked data they are interested
using the expansion algorithm. Popitsch & Haslhofer (2010)
presented DSNotify, a tool for detecting and fixing broken
links, which can keep links between LOD datasets consistent
with required user input. Lui and Li (2011) proposed an
approach which relies on the metadata of data sources to track
the data changes by capturing the modifications of the data in
real time, adjusting notification timing with different
requirements from the data consumers. However, none of
these studies has been done in the LOD context, in an attempt
to estimate the impact and potential causes for the problem.
In terms of Web analysis, some studies have been also
conducted in the context of constructing and analyzing of
World Wide Web graph. Deo & Gupta (2003) investigated
several graph models of the “traditional” Web and made use
of graph-theoretic algorithms that help explaining and
structuring the growth of the Web. Serrano et al. (Serrano,
Maguitman, Boguñá, Fortunato, & Vespignani, 2007) also
analyzed the “traditional” Web by collecting content provided
by web crawlers and representing them in four different
clusters, depending on the strength of their relationships. They
illustrated the data in different graphs and examined them
according to size, node degree and degree correlations.
Experimental setting
There has been a considerable growth of datasets, which
are different in subject and size. In order to provide a more
precise depiction of LOD datasets, we studied statistical
measures from a graph perspective. The measures are taken
from the toolkit of Social Network Analysis (SNA), as a way
to analyze the LOD cloud from the perspective of a wide
network of interactions between entities. However, these
measures are not exclusive of SNA but they were also used in
previous graph analysis research of the World Wide Web.
Linked datasets, in such a social network, were mapped to
nodes – i.e. entities– while links between datasets were
reconsidered as relationships between those entities.
There exists a wide range of SNA metrics which allow
researchers to integrate and analyze mathematical and
substantive dimensions of a network structure formed as a
result of ties formed between persons, organizations, or other
types of nodes (Wasserman, 1994) (Scott, 2000). Some of
these metrics are:
Betweenness Centrality (BC), which measures how often
a node appears on the shortest path between two other
nodes. High betweenness nodes are usually key players
in a network or a bottleneck in a communication network.
Thus, it is used for detecting important nodes in graphs.
Degree, the count of connections a node has with other
nodes including self-connections. It is the most common
topological metric in networks.
Edge weight, a number assigned to each edge that
represents how strong the relationship between two
nodes in a graph is.
We made use of SNA metrics to graphically highlight LOD
network, to get insights of the arrangement of LOD datasets
and to evaluate their properties from a mathematical
perspective. In consequence, we collected all the information
about LOD datasets from CKAN iv by using a software
component that exploited the links between datasets.
Gathering information about 337 datasets with almost 450
million links, we aligned all the collected information in a
data matrix. It should be noted that the collected data was
curated by the maintainers of the datasets, and thus it can be
regarded as a reliable estimation.
We used case-by-affiliation matrix (Wasserman, 1994), a
general form of data matrices for social networks, in which
the rows and columns refer to LOD datasets and the values are
the number of outgoing links of each dataset. The data were
imported in a SNA tool, in this case NodeXLv (Hansen,
Shneiderman, & Smith, 2010), following SNA metrics applied
to recognize the central datasets in the LOD, as these metrics
appear to enclose the relevant correlations of a graph.
Betweenness Centrality (BC): If a dataset has a high BC
value, then many datasets are connected through it to
others, which implies that the dataset plays an important
role in the LOD cloud.
Degree: illustrates the number of datasets that are
conjoined to the current node (dataset).
Edge weight: illustrates the number of links between two
datasets.
Table 1 illustrates the top five datasets with higher BC
along with their degree values. As the table shows, LOD
graph has been regarded as a directed graph. Incoming degree
values refer to the number of datasets that point to the current
dataset, while the outgoing degree stands for the number of
datasets pointed to by the current dataset. DBpedia, as it has
illustrated in the table, shows the highest BC, as it is in the
middle of many paths between other datasets. Geonames
provides global place names, their location and some
additional information such as population. Hence, it includes
global information that referred by 55 datasets and has no
outgoing links; nevertheless, it is the second dataset with high
BC.
Table 1: Top five datasets with high betweenness centrality
Dataset
In
-
Degree
Out
-
Degree
Betweenness
Centrality
DBpedia 181 30 82,664.24
Geonames 55 0 10,958.12
DrugBank 8 12 7,446.53
Bio2rdf-goa 11 8 3,751.97
Ordance-survey 16 0 3,272.72
With all the information extracted, we represented the data
in NodeXL, an open-source template for Microsoft Excel that
facilitates to explore network graphs. Once we had all datasets
in NodeXL, we filtered out those with less than 2 incoming
links (226 datasets out of 337) to depict the final graph in a
more understandable way. Figure 1 illustrates the generated
LOD graph using a “Harel-Koren Fast Multiscale” layout. As
the figure shows, DBpedia is in the centre of the LOD cloud,
thus acting as a hub in the network. This was our main
motivation to go deeper in the analysis of DBpedia instead of
analyzing other datasets. In fact, we selected DBpedia as our
case study only after a careful graph analysis and examination
of its external links, which finally persuaded us that this could
be properly considered the central dataset of the LOD cloud.
Figure 1 also shows why some datasets such as Geonames and
Drugbank have high BC among the LOD datasets, as either
they have been in the middle of a path or they were pointed by
other datasets.
Figure 1: DBpedia as a hub in the LOD cloud graph
NodeXL also allows experts clustering a graph in different
groups (Hansen et al., 2010) and we used this feature to
confirm that DBpedia was indeed the central dataset in
comparison to grouped datasets.
As it is well known, the DBpedia dataset contains
structured information extracted from Wikipedia with the aim
of making this information available on the Web of Data, as
well as linking to other datasets. It includes structured data
about persons, places and organizations which features labels
and abstracts for 10.3 million unique things in 111 different
languages (Bizer, Lehmann, et al., 2009); the full DBpedia
dataset features almost 36 million data linksvi to external RDF
datasets. Links to some datasets such as the Flickr wrapprvii
and Freebaseviii have been automatically created by using
software tools like Silk (Bizer, Volz, et al., 2009). According
to the number of links between DBpedia and other datasets in
the LOD cloud, the Flickr wrappr is the first dataset with more
than 31 million links to DBpedia, while Freebase is the second
one with 3.6 million links. Apart from these two datasets,
Figure 2 illustrates the percentage of DBpedia outgoing links
where “Others” in the figure (with 3%) refers to the datasets
included less than 10 thousand links with DBpedia.
Figure 2: DBpedia outgoing links
As mentioned earlier, several problems can cause a broken
link, all of which must be carefully checked. To examine the
availability of the links, we programmed a link checker
component that retrieved the HTTP response headers of the
URLs. Particularly, a primarily broken link will return a
HTTP 400 or 404 codes, indicating that there is an error with
the target and thus it is unreachable. We clustered all the
HTTP responses into several groups, such as “server is
unreachable”, “time out”, and “non existing record”. The
workflow in Figure 3 shows how the tool checks every
outgoing link of the DBpedia. The results afterwards were
inserted into a database to be later evaluated. In addition to
validating a URL, the software set the response timeout to 10
seconds, which means that the lack of any server activity of
the target for this duration was considered to be an error.
The problem with this approach to link checking is that it
may happen that some target data cannot be fetched at the
time of the request, but the same data might be available again
in future, which requires to periodically run the link checking
component. We examined the availability of over 1.67 million
links of DBpedia based on the schedule presented in Table 2.
This schedule was started on January 2013 and followed over
4 months in order to analyze the links precisely.
Table 2: Link checking schedule
Month Link checking dates
January 8th, 18th, and 28th
February 1st, 11th, and 28th
March 4th, 14th, and 24th
April 3rd, 13th, and 23rd
We filtered out both the Flickr wrappr and Freebase
datasets as they include millions of links to DBpedia, which
caused the link checking process to become too time-
consuming for our computational capabilities. The target time
of response was also checked, assuming that the link was live
if the target responded before 10 seconds. We did not discover
any authorization problem among the analyzed URLs, as
some links may be unreachable due to security reasons.
Figure 3: Link checker workflow
Table 3 illustrates the external datasets of the DBpedia
along with the number of links and the average number of
broken links detected during the scheduled process. The forth
column in the table shows the average number of broken links
relative to the dataset size (illustrated per 100,000 triples). We
listed more detailed information about the datasets in
Appendix 1. Table 3 shows how, for example, the
Italian_Public_school dataset comprises a total of 169,000
triples (around 1.69 triples per 100,000) of which 5,822 were
broken. The number 1,148 in the table (dividing 5,822 into
1.69) shows the status of the dataset from the availability
perspective. Diseasome, as another example, is supposed to be
very problematic as it has only 91,000 triples, 2,301 of which
were not reachable through DBpedia.
Table 3: DBpedia related datasets
Dataset Links
number Average # of
broken links Average # related
to dataset size
Revyu 6 0 0
GHO 196 0 0
BBCwildlife 444 0 0
Amsterdam
Museum 627 0 0
Openei 678 0 0
Dbtune
-
musicbrainz 838 0 0
Eunis 3,079 0 0
linkedmdb 13,758 0 0
Bricklink 10,090 1 0
Uscensus 12,592 1 0
Bookmashup 8,903 4 0
WordNet
437
,
79
6 6 0
Eurostat 490 15 (3%) 0
geospecies 15,974 41 2
Factbook 545 5 13
Nytimes 9,678 55 16
Dailymed 894 150 (17%) 91
TCM 904 151(17%) 128
Gadm 1,937 163 (8%) Unspecified ix
DBLP 196 196 (196%) 1
Wikicompan
y 8,348 199 (2%) Unspecified
Cordis 314 314 (100%) 4
Geonames 86,547 336 0
Umbel
891
,
82
2 1005 210
Italian_publi
c_schools 5,822 1,940 (33%) 1,148
Gutendata 2,511 2,100 (84%) 21,000
Diseasome 2,301
2
,
301
(100%) 2,529
DrugBank 4,845 4,745 (98%) 619
Musicbrainz 22,980
22
,
980
(100%) 38
Opencyc 27,107
27
,
107
(100%) 1,694
Linkedgeoda
ta
103
,
61
8
37
,
791
(36%) 175
Figure 4 shows the number of broken outgoing links in
DBpedia by dataset (only datasets with more than 100 links).
Figure 4: DBpedia broken outgoing links
An analysis was later carried out on the logs of the link
checker with the aim of discovering the implications of the
broken links problem. Figure 5 shows that more than 55% of
the external links identified as broken were due either to the
fact that the service exposing the dataset was down, or the
server was not reachable. Nearly 32% of the total amount of
external links referred to the targets that did not return the data
in 10 seconds and the browser timed out. Furthermore, more
than 10% of DBpedia links pointed to records that did not
exist in the target dataset. Finally, the related services of a
small percentage of the broken links (around 2%) were
temporarily unavailable.
Figure 5: percentage of causes of broken links
Table 4 also illustrates each dataset along with the type of
the error we faced during the links checking.
Table 4: DBpedia related datasets
Dataset Cause of failure
Geospecies Non existing record
Nytimes Non existing record
Dailymed Service temporarily unavailable
TCM Timeout
Gadm Service is down or unreachable
DBLP Service temporarily unavailable
Wikicompany Non existing record
Cordis Timeout
Geonames Non existing record
Umbel Non existing record
Italian_public_schools Service is down or unreachable
Gutendata Service is down or unreachable
Diseasome Service temporarily unavailable
Musicbrainz Service is down or unreachable
Opencyc Timeout
DrugBank Timeout
Linkedgeodata
Non existing record
/
Service is down or unreachable
Conclusions and outlook
By evaluating the results of our link checking system over
DBpedia, we classified the related datasets in different groups:
Live and fully accessible datasets (through DBpedia
links), such as Openei and Eunis.
Datasets that were only partially reachable and which
include links those were not accessible through
DBpedia. In particular, some data did not exist in the
external data sources anymore (e.g., Wikicompany and
Geonames)
Datasets which were fully broken. Our analysis shows
that the hosts of these datasets were not reachable either
temporarily or permanently (e.g., Cordis and
Musicbrainz).
Datasets which were not reachable during a certain
period of time. Most often, the link checker could
successfully access the related host(s) during a
posterior checking process (e.g., TCM, Dailymed).
Some datasets (e.g., opencyc, Diseasome) did not
provide a linked data access including publishing
dereferenceable URIs, a SPARQL endpoint, and RDF
dump.
A number of datasets (e.g., Musicbrainz) did not
provide support, as it seems they were the result of
research projects working on a voluntary or project-
bound basis (e.g., individuals, and universities). Given
the way they were managed, it is uncertain whether
they will continue operating in the long term or at least
providing free access services.
A manual evaluation of the links found that all the broken
links of the Umbel dataset belonged to one URL. In particular,
only one link in the target dataset was unreachable through
around 1,000 links. In addition and with regard to what Table
1 illustrates, the DrugBank dataset, which had a high BC
value among other datasets, even though around 98% of the
outgoing links to this dataset were broken.
With respect to the results, we examined the most common
current approaches for dealing with broken links in the LOD
datasets. Data publishers can fix the problem automatically by
using a link checking component. A link checker can be also
applied to a data source to periodically detect and fix the
broken links. Data consumers can also report manually the
broken links to the data providers for them to resolve this
problem. This solution is ineffective and slow, though. Other
solutions such as the handle system (Sun, 2001) or PURL
provide a number of services for unique and permanent
identifiers of digital objects when web infrastructure has been
changed and enable data publishers to store identifiers of
arbitrary resources. Data providers can utilize those services
to resolve identifiers into the information necessary to access,
to later authenticate and update the current state of the
resource without changing its identifier. This provides the
benefit to allow the name of the item to persist over changes
of e.g. location.
The research presented here in can be extended by going
further in checking logs and examining the links manually.
Specifically it can be analyzed for those datasets redirected to
another target in terms of their availability and cause of the
redirection.
There is also a wide variety of datasets published in the
LOD cloud. Similar to outgoing links of the DBpedia, link
checking could be extended to other datasets as well. In
particular, the whole status of the LOD datasets, in terms of
link availability, can be achieved by examining all the
outgoing links for each dataset. As a result, the LOD cloud
could be traced in the case of broken links and a central
reposting system, for example as a part of LOD stat portal,
can help datasets to fix their broken links.
Acknowledgements
The work presented in this paper has been part-funded by
the European Commission under the ICT Policy Support
Programme CIP-ICT-PSP.2011.2.4-e-learning with project No.
297229 “Open Discovery Space (ODS)”, CIP-ICT-
PSP.2010.6.2- Multilingual online services with project No.
27099 “Organic.Lingua”, and INFRA-2011-1.2.2-Data
infrastructures for e-Science with project No.
283770 “AGINFRA”.
References
Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked Data - The Story So
Far. International Journal on Semantic Web and Information
Systems, 5(3), 1–22.
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., &
Hellmann, S. (2009). DBpedia - A crystallization point for the
Web of Data. Web Semant., 7(3), 154–165.
Bizer, C., Volz, J., Kobilarov, G., & Gaedke, M. (2009). Silk - A Link
Discovery Framework for the Web of Data. In the 18th
International World Wide Web Conference, Madrid, Spain.
Davis, H. C. (1999). Hypertext link integrity. ACM Comput. Surv., 31(4es).
Deo, N., & Gupta, P. (2003). Graph-Theoretic Analysis of the World Wide
Web: New Directions and Challenges. Mathematica
Contompornea,Sociedade Brasileira de Matemática,, 25, 49–69.
Eysenbach, G., & Trudel, M. (2005). Going, Going, Still There: Using the
WebCite Service to Permanently Archive Cited Web Pages.
Journal of Medical Internet Research, 7(5), e60.
Fernandez, M., d’Aquin, M., & Motta, E. (2011). Linking data across
universities: an integrated video lectures dataset. In Proceedings of
the 10th international conference on The semantic web - Volume
Part II (pp. 49–64). Berlin, Heidelberg: Springer-Verlag.
Hansen, D., Shneiderman, B., & Smith, M. A. (2010). Analyzing Social
Media Networks with NodeXL: Insights from a Connected World
(1st ed.). Morgan Kaufmann.
Hausenblas, M., Halb, W., & Raimond, Y. (2008). Scripting User Contributed
Interlinking. In Proceedings of the 4th workshop on Scripting for
the Semantic Web (SFSW2008), co-located with ESWC2008.
Liu, F., & Li, X. (2011). Using Metadata to Maintain Link Integrity for
Linked Data. In Proceedings of the 2011 International Conference
on Internet of Things and 4th International Conference on Cyber,
Physical and Social Computing (pp. 432–437). Washington, DC,
USA: IEEE Computer Society.
Morsey,M. Lehmann, J., Auer, S., Stadler, C. and Hellmann, S. (2012)
DBpedia and the live extraction of structured data from Wikipedia,
Program: electronic library and information systems, 46(2), pp.157
- 181
Ngonga Ngomo, A.-C., & Auer, S. (2011). LIMES — A Time-Efficient
Approach for Large-Scale Link Discovery on the Web of Data. In
Twenty-Second International Joint Conference on Artificial
Intelligence.
Popitsch, N. P., & Haslhofer, B. (2010). DSNotify: handling broken links in
the web of data. In Proceedings of the 19th international
conference on World wide web (pp. 761–770). New York, NY,
USA: ACM.
Scharffe, F., Liu, Y., & Zhou., C. (2009). RDF-AI: an architecture for RDF
datasets matching, fusion and interlink. In Proceedings of IJCAI
2009 IR-KR Workshop.
Scott, J. (2000). Social Network Analysis: A Handbook. SAGE publication.
Serrano, M. Á., Maguitman, A., Boguñá, M., Fortunato, S., & Vespignani, A.
(2007). Decoding the structure of the WWW: A comparative
analysis of Web crawls. ACM Trans. Web, 1(2).
Sun, S. (2001). Establishing persistent identity using the handle system. In
Proceedings of the Tenth International World Wide Web
Conference.
Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and
applications. New York: Cambridge University Press.
Vesse, R., Hall, W., & Carr, L. (2010). Preserving Linked Data on the
Semantic Web by the application of Link Integrity techniques
from Hypermedia. In, Linked Data on the Web (LDOW2010),
Raleigh, NC.
Appendix 1
Dataset name URL Subject
Revyu http://revyu.com/ review and rate things
Gho http://gho.aksw.org/
statistical data
for health problems
BBC WildLife http://www.bbc.co.uk/wildlifefinder/ Nature
Amsterdam Museum http://semanticweb.cs.vu.nl/lod/am/ Culture
OpenEI http://en.openei.org/ energy information
Dbtune http://dbtune.org/musicbrainz/ Music
Eunis http://eunis.eea.europa.eu biodiversity
LinkedMDB http://linkedmdb.org/ Movie
Bricklink http://kasabi.com/dataset/bricklink Marketing
Uscenus http://www.rdfabout.com/demo/census/ Population statistics
Bookmashup http://www4.wiwiss.fu-berlin.de/bizer/bookmashup/ Book
Factbook http://www4.wiwiss.fu-berlin.de/factbook/ Countries
WordNet http://www.w3.org/TR/wordnet-rdf lexical database of English
Eurostat http://eurostat.linked-statistics.org/ European Statistics
geospecies http://lod.geospecies.org/ GeoSpecies
Nytimes http://data.nytimes.com/ News
Dailymed http://www4.wiwiss.fu-berlin.de/dailymed/ Drugs
TCM http://code.google.com/p/junsbriefcase/wiki/TGDdataset medicines
Gadm http://gadm.geovocab.org/ GIS
DBLP http://www4.wiwiss.fu-berlin.de/dblp/ Book
Wikicompany http://wikicompany.org/ business
Cordis http://www4.wiwiss.fu-berlin.de/cordis/ EU programmes and projects
Geonames http://www.geonames.org/ontology/ geography
Umbel http://umbel.org/ technology and semantics
Italian_public_schools http://www.linkedopendata.it/datasets/scuole schools
Gutendata http://www4.wiwiss.fu-berlin.de/gutendata/ ebook
Diseasome http://www4.wiwiss.fu-berlin.de/diseasome/ disease
DrugBank http://wifo5-03.informatik.uni-mannheim.de/drugbank Drugs
Musicbrainz http://zitgist.com/ Music
Opencyc http://sw.opencyc.org/ diverse collection
of real-world concepts in
OpenCyc
Linkedgeodata http://linkedgeodata.org/ geography
Endnote
i http://lod-cloud.net/ (Retrieved 2013-06-22)
ii See http://www.w3.org/Provider/Style/URI
iii http://dbpedia.org/About (Retrieved 2013-06-22)
iv http://ckan.org/ (Retrieved 2013-06-22)
v http://nodexl.codeplex.com/ (Retrieved 2013-06-22)
vi http://wiki.dbpedia.org/About (Retrieved 2013-06-22)
vii http://wifo5-03.informatik.uni-mannheim.de/flickrwrappr/ (Retrieved 2013-06-22)
viii http://www.freebase.com (Retrieved 2013-06-22)
ix The size of the dataset was unavailable both in the LOD database and through the provider’s website