Page 1
Towards OpenMath Content Dictionaries as
Linked Data
Christoph Lange
Computer Science, Jacobs University Bremen,
ch.lange@jacobs-university.de
Abstract. “The term ‘Linked Data’ refers to a set of best practices for
publishing and connecting structured data on the web” [7]. Linked Data
make the Semantic Web work practically, which means that informa-
tion can be retrieved without complicated lookup mechanisms, that a
lightweight semantics enables scalable reasoning, and that the decentral
nature of the Web is respected. OpenMath Content Dictionaries (CDs)
have the same characteristics – in principle, but not yet in practice.
The Linking Open Data movement has made a considerable practical
impact: Governments, broadcasting stations, scientific publishers, and
many more actors are already contributing to the “Web of Data”. Queries
can be answered in a distributed way, and services aggregating data
from different sources are replacing hard-coded mashups. However, these
services are currently entirely lacking mathematical functionality. I will
discuss real-world scenarios, where today’s RDF-based Linked Data do
not quite get their job done, but where an integration of OpenMath would
help – were it not for certain conceptual and practical restrictions.
I will point out conceptual shortcomings in the OpenMath 2 specification
and common bad practices in publishing CDs and then propose concrete
steps to overcome them and to contribute OpenMath CDs to the Web
of Data.
1 Linked Data State of the Art
The Linked Data principles, established by Berners-Lee in 2006 [4] consist of
four simple rules for publishing machine-understandable data on the web1:
1. Use URIs to identify things.
2. Use HTTP URIs so that these things can be referred to and looked up
(“dereferenced”) by people and user agents.2
3. Provide useful3 information about the thing when its URI is dereferenced,
using standard formats such as RDF/XML.
4. Include links to other, related URIs in the exposed data to improve discovery
of other related information on the Web.
1 here cited as paraphrased by Wikipedia [29]
2 I. e., the URI is treated as a URL.
3 This usually means: machine-understandable.
ar
X
iv
:1
00
6.
40
57
v1
[
cs
.D
L]
2
1 J
un
20
10
Page 2
As of March 2009
LinkedCT
Reactome
Taxonomy
KEGG
PubMed
GeneID
Pfam
UniProt
OMIM
PDB
Symbol
ChEBI
Daily
Med
Disea-
some
CAS
HGNC
Inter
Pro
Drug
Bank
UniParc
UniRef
ProDom
PROSITE
Gene
Ontology
Homolo
Gene
Pub
Chem
MGI
UniSTS
GEO
Species
Jamendo
BBC
Programm
es
Music-
brainz
Magna-
tune
BBC
Later +
TOTP
Surge
Radio
MySpace
Wrapper
Audio-
Scrobbler
Linked
MDB
BBC
John
Peel
BBC
Playcount
Data
Gov-
Track
US
Census
Data
riese
Geo-
names
lingvoj
World
Fact-
book
Euro-
stat
flickr
wrappr
Open
Calais
Revyu
SIOC
Sites
Doap-
space
Flickr
exporter
FOAF
profiles
Crunch
Base
Sem-
Web-
Central
Open-
Guides
Wiki-
company
QDOS
Pub
Guide
RDF
ohloh
W3C
WordNet
Open
Cyc
UMBEL
Yago
DBpedia
Freebase
Virtuoso
Sponger
DBLP
Hannover
IRIT
Toulouse
SW
Conference
Corpus
RDF Book
Mashup
Project
Guten-
berg
DBLP
Berlin
LAAS-
CNRS
Buda-
pest
BME
IEEE
IBM
Resex
Pisa
New-
castle
RAE
2001
CiteSeer
ACM
DBLP
RKB
Explorer
eprints
LIBRIS
Semantic
Web.org
Eurécom
RKB
ECS
South-
ampton
CORDIS
ReSIST
Project
Wiki
National
Science
Foundation
ECS
South-
ampton
Fig. 1. Linked Open Datasets as of March 2009 [11]
These principles are widely considered to have made the Semantic Web vision
work practically. A lot of providers have already published their data according
to these principles and interlinked them with other datasets (cf. figure 1). The
hub in this big picture is DBpedia, a huge collection of general-purpose data
extracted from Wikipedia and made available as RDF. Data from specific do-
mains, such as scientific publications (green), biomedicine (pink), social networks
(orange), multimedia (dark blue), and government statistics (yellow) have also
been published as Linked Open Data. Linked Data do not have to be open4, but
making datasets open of course helps to interlink and reuse knowledge; there-
fore, the open datasets have so far been the most visible and most widely used
instance of Linked Data. Applications include browsers, which allow users to
traverse the Web of Data and discover connections, semantic search engines and
indexes, which enable a more accurate information retrieval than keyword-based
engines, as well as mashups that aggregate Linked Data from distributed sources
and expose them via a coherent user interface (see, e. g., [15] for an interactive
map of database researchers and their publications, filterable by research topics).
4 In fact they can also be useful in intranet settings, cf. [24]
Page 3
Listing 1.1. Geese on the Isle of Wight, RDF data in Turtle notation, from
data.gov.uk (all URIs abbreviated, namespace prefix mappings omitted for
brevity; see [28] for full example)
ahs:EH100 # just some ID for this data point
scv:dimension env:isle-of-wight ; # the "region" dimension
scv:dimension env:year-2008 ; # the "time" dimension
scv:dimension env:geese ; # the type of items counted
rdf:value "693"^^xsd:decimal ; # the count
scv:dataset ahs2:livestock . # back-reference to the dataset
2 The Need for Mathematical Semantics
None of the Linked Open datasets and applications known to date deals with
mathematical knowledge, not counting mere descriptions that do not involve
any mathematical semantics, e. g., of mathematical publications or mathematical
research topics, as they can be found in publication datasets or DBpedia. With
“Linked Open Numbers” [27], one mathematical dataset has been published,
but that was not to be taken serious. There, every natural number from 1 to
999,999,999 is described with its predecessor, successor, natural logarithm, and
its name in various natural languages. This is pretty useless information5, and
indeed “Linked Open Numbers” was an April fool’s joke cartooning the rampant
bad habit of mindlessly publishing datasets that are very large but not reasonable
at all.
There is, however, no doubt, that mathematical semantics is needed in order
to improve, or even enable, certain serious applications of Linked Data. I con-
sider statistical datasets, which are now being published as RDF Linked Data,
e. g., by the UK and US governments, a prime example. Omitola et al. have, for
example, used such data in order to answer queries for public sector information
in the user’s home region by aggregating data about, e. g., political representa-
tives of the local constituencies, crime statistics for the local county, and waiting
list statistics of local hospitals [21]. At the moment, these datasets contain a lot
of data points (e. g. the number of geese on the Isle of Wight in 2008; cf. list-
ing 1.1), without making their origin semantically explicit. We have proposed an
extension of the relevant Statistical Core Vocabulary (SCOVO), which allows to
express the latter knowledge, saying, e. g., “the things that we are counting here
are geese (e. g. by referencing http://dbpedia.org/resource/Goose) per
area and per year” [28]. Mathematical knowledge becomes relevant when mod-
eling derived values, such as the geese population density of a region in a given
year, defined as the number of geese divided by area.6 At the moment, there
5 On the other hand, it might be useful to publish as Linked Data facts about numbers
that are hard to compute, e. g. factorizations of large numbers.
6 The geese population density is a fictitious example, but in the actual datasets, there
are derived values such as the [human] population density of various census regions,
or the average number of jobs per citizen.
Page 4
Listing 1.2.Geese population density of the Isle of Wight, with its mathematical
semantics
# the density is computed by ...
ahs:PD100 sl:computedFrom [
# ... calling OpenMath’s arith1#divide
sl:function <http://www.openmath.org/cd/arith1#divide> ;
sl:arguments
# ... passing the value of the EH100 data point as first argument
[ sl:argPosition "1"^^xsd:int ;
sl:argValue ahs:EH100 ] ,
# ... and the value of the AR100 data point as second argument
[ sl:argPosition "2"^^xsd:int ;
sl:argValue ahs:AR100 ] ].
are a lot of derived values in the datasets published, simply given as additional
raw data points. For a client consuming these data, there is no way of verifying
their correctness or applying the same derivation rule to new or changed base
values, because the derivation rule is not made explicit. We have shown how
to make their mathematical semantics explicit – first on the instance level, as
that integrates most easily into existing datasets. Let the data point with the ID
ahs:AR100 be the area of the Isle of Wight, and let ahs:PD100 be the geese pop-
ulation density of the Isle of Wight in 2008, then we could express the fact that
the latter is ahs:EH100 divided by ahs:PD100 by referencing the OpenMath
symbol for division (cf. listing 1.2). In a second step, the same could be done on
vocabulary level: In addition to, or alternatively to, explicitly representing the
derivation of each data point, one could model a general rule that “for each data
point p containing a ‘population’ of some region r at some point t in time and
for each data point a containing the area of r [at time t], the population density
d of r at time t is defined as d := pa ”. Recall, however, that the semantics of
Linked Data vocabularies is usually intentionally weak in order to enable large-
scale applications. Such general rules would require more powerful clients and
query engines and might therefore not work as universally as semantically more
lightweight (albeit blatantly redundant) annotations of individual data points.
For computing such a derivation, a Linked Data client has to translate these
RDF data to an OpenMath object, which has to be fed to a computation ser-
vice, e. g. a service that speaks SCSCP [12, 13]. We have detailed the trans-
lation in [28]. For standard symbols, such as arith1#divide here, the transla-
tion is pretty straightforward. Computing the division should not be a problem
for any OpenMath-aware service, as there is certainly a phrasebook mapping
arith1#divide to the native division operator of some computer algebra system.
But now suppose that there are more complex, non-standard derivations in
our statistical dataset. This makes the case for publishing OpenMath CDs as
Linked Data, by the following considerations: Suppose the dataset contains the
Page 5
Human Development Index (HDI) of a country7. Assuming that the four required
auxiliary data points have already been computed (LE = life expectancy index,
ALI = adult literacy index, GEI = gross enrollment index, and GDP = an index
computed from the gross domestic product per capita at purchasing power parity,
all normalized to a scale between 0 and 1), the HDI is defined as 13 (LE+
2
3ALI +
1
3GEI+GDP). In [28], we propose that the dataset publishers define theHDI and
its derivation as a symbol in an OpenMath CD that accompanies the dataset, e. g.
http://example.org/statistics. Now suppose there is a derived data
point annotated as sl:computedFrom [ sl:function <http://example.org/
statistics#hdi> ; ... ] in analogy to listing 1.2. As OpenMath-based compu-
tation services and thus phrasebooks are developed independently from datasets
being published, we have little to no chance to expect a phrasebook supporting
the http://example.org/statistics CD. Therefore, we propose to add
support for processing OpenMath CDs to Linked Data clients. For (re)computing
an HDI data point derived from four other data points containing the LE,
ALI, GEI, and GDP values, the client would download the definition of the
http://example.org/statistics#hdi symbol from the CD, expand the
mathematical expression using the definition, and then send that expanded ex-
pression, which only uses operators from the universally understood arith1 CD,
to the computation service.8
So far, I have outlined one use case, where OpenMath CDs as Linked Data
would be needed. In the following section, I will point out what actions on the
OpenMath side that requires. Note that, while the Linked Data principles have
been devised in the context of RDF, and while all contemporary Linked Open
datasets are available as RDF, the Linked Data guidelines do not prescribe RDF.
In fact, RDF might not be the most appropriate representation for mathematical
objects. It is at least quite cumbersome to break the ordered tree structure of
mathematical expressions down to unordered RDF triples (cf. [19] for one never-
adopted suggestion on how that could be done, and [28] for a critical review).
For the remainder of this paper, I assume that CDs will be published in their
reference XML encoding.
3 Linked Data Principles in OpenMath
First, let us see how much the Linked Data principles cited in section 1 are
already respected in the practice of publishing OpenMath CDs:
1. Hardly any CD author uses CDBase, which indicates a lack of awareness
that things can be identified by URIs.
2. The URIs used for OpenMath CDs/symbols are always HTTP URLs, but
due to the inconsequent usage of CDBase (cf. principle 1), most published
7 http://en.wikipedia.org/wiki/Human_Development_Index
8 Here, we assume that those values, from which the HDI is computed, are either
hard-coded in the dataset, or that they have been computed before, using the same
method.
End of preview.