Conference PaperPDF Available

Towards LarKC: A Platform for Web-Scale Reasoning

Authors:

Abstract and Figures

Current semantic Web reasoning systems do not scale to the requirements of their hottest applications, such as analyzing data from millions of mobile devices, dealing with terabytes of scientific data, and content management in enterprises with thousands of knowledge workers. In this paper, we present our plan of building the large knowledge collider, a platform for massive distributed incomplete reasoning that will remove these scalability barriers. This is achieved by (i) enriching the current logic-based semantic Web reasoning methods, (ii) employing cognitively inspired approaches and techniques, and (iii) building a distributed reasoning platform and realizing it both on a high-performance computing cluster and via "computing at home". In this paper, we will discuss how the technologies of LarKC would move beyond the state-of-the-art of Web scale reasoning.
Content may be subject to copyright.
Towards LarKC: a Platform for Web-scale Reasoning
Dieter Fensel (University of Innsbruck) Frank van Harmelen (Vrije Universiteit Amsterdam)
Bo Andersson (Astrazeneca AB) Paul Brennan (International Agency for Research on Cancer)
Hamish Cunningham (University of Sheffield) Emanuele Della Valle (CEFRIEL)
Florian Fischer (University of Innsbruck) Zhisheng Huang (Vrije Universiteit Amsterdam)
Atanas Kiryakov (OntoText Lab (Sirma AI Ltd.) Tony Kyung-il Lee (Saltlux)
Lael Schooler (Max Planck Institute for Human Development, Berlin) Volker Tresp (Siemens)
Stefan Wesner (University of Stuttgart) Michael Witbrock (Cycorp Europe)
Ning Zhong (Beijing University of Technology)
Abstract
Current Semantic Web reasoning systems do not scale to
the requirements of their hottest applications, such as an-
alyzing data from millions of mobile devices, dealing with
terabytes of scientific data, and content management in en-
terprises with thousands of knowledge workers. In this
paper, we present our plan of building the Large Knowl-
edge Collider, a platform for massive distributed incom-
plete reasoning that will remove these scalability barriers.
This is achieved by (i) enriching the current logic-based Se-
mantic Web reasoning methods, (ii) employing cognitively
inspired approaches and techniques, and (iii) building a
distributed reasoning platform and realizing it both on a
high-performance computing cluster and via ”computing at
home”. In this paper, we will discuss how the technologies
of LarKC would move beyond the state-of-the-art of Web-
scale reasoning.
1. Introduction
Michael Lynch, CEO and Founder of Autonomy
1
, re-
cently stated that ”meaning-based computing is the way of
the future as 80 per cent of information within enterprises
is unstructured and that understanding this ’hidden’ intel-
ligence is at the heart of improving the way we interact
with information”. Some of the most advanced use cases
for such semantic computing today require reasoning about
10 billion RDF triples in less than 100 ms. These num-
bers originate from the telecom sector aiming to generate
revenue streams through new context-sensitive and person-
alized mobile services, but this is just one example of a gen-
eral demand. The Web has made tremendous amounts of in-
1
Europe’s second largest software company, with enterprise search and
knowledge management as its core markets.
formation available that could be processed based on formal
semantics attached to it. The Semantic Web has developed
a number of languages that use logic for this purpose. How-
ever, current logic based reasoning systems does not scale
to the amount of information and the setting that is required
for the Web. The inherent trade-off between the expressive-
ness of a logical representation language and scalability of
reasoning over the information has been clearly observed
from a theoretical point of view [2] but has also show to
have a very practical impact on possible use-cases [9].
Thus a reasoning infrastructure must be designed and
built that can scale and that can be flexibly adapted to the
varying requirements of quality and scale of different use-
cases. If such an infrastructure is not built, ”meaning-based
computing” will never happen on the Web and will remain
confined to well-controlled data-sets inside company in-
tranets.
In this paper, we present our plan of building the Large
Knowledge Collider, a platform for Web-scale reasoning,
which research is carried on by the EU 7th framework
Project LarKC
2
. The aim of LarKC is to develop a plat-
form for massive distributed incomplete reasoning that will
remove the scalability barriers of currently existing reason-
ing systems for the Semantic Web. This will be achieved
by:
Enriching the current logic-based Semantic Web rea-
soning with methods from information retrieval, ma-
chine learning, information theory, databases, and
probabilistic reasoning.
Employing cognitively inspired approaches and tech-
niques such as spreading activation, focus of attention,
reinforcement, habituation, relevance reasoning, and
bounded rationality.
Building a distributed reasoning platform and realizing
it both on a high-performance computing cluster and
via ”computing at home”.
2
http://www.larkc.eu
The rest of the paper is organized as follows. Section
2 presents the visions, missions, objectives and strategies
of LarKC as an overview of the Large Knowledge Collider.
Section 3 discusses the state-of-the-art of Web-scale reason-
ing and its relations with LarKC. Section 4 provides several
use cases of LarKC. Section 5 concludes the paper.
2. Overview of LarKC
2.1. Vision
The driving vision behind LarKC is to go beyond the
limited storage, querying and inference technology cur-
rently available for semantic computing. The fundamen-
tal assumption taken is that such an infrastructure must go
beyond the current paradigms which are strictly based on
logic. By fusing reasoning (in the sense of logic) with
search (in the sense of information retrieval), and taking
seriously the notion of limited rationality (in the sense of
Herbert Simon [11]) we will obtain the paradigm shift that
is required for reasoning at Web scale.
The overall vision of LarKC is: To build an integrated
platform for semantic computing on a scale well beyond
what is currently possible. The platform will fulfill needs
in sectors that are dependent on massive heterogeneous in-
formation sources such as telecommunication services, bio-
medical research, and drug-discovery. The platform will
have a pluggable architecture in which it is possible to ex-
ploit techniques and heuristics from diverse areas such as
databases, machine learning, cognitive science, Semantic
Web, and others. The platform will be implemented on a
computing cluster and via ”computing at home”, and will
be available to researchers and practitioners from outside
the consortium to run their own experiments and applica-
tions, using suitable plug-in components.
2.2. Missions and Objectives
We will develop the Large Knowledge Collider as a plug-
gable algorithmic framework which will be implemented on
a distributed computational platform. It will scale to web-
reasoning by trading quality for computational cost and by
embracing incompleteness and unsoundness.
Pluggable: Instead of being built only on logic, the Large
Knowledge Collider will exploit a large variety of meth-
ods from other fields: cognitive science (human heuris-
tics), economics (limited rationality and cost/benefit trade-
offs), information retrieval (recall/precision trade-offs), and
databases (very large datasets). A pluggable architecture
will ensure that computational methods from these differ-
ent fields can be coherently integrated.
Distributed: The Large Knowledge Collider will be im-
plemented on parallel hardware using cluster computing
techniques, and will be engineered to be ultimately scal-
able to very large distributed computational resources, us-
ing techniques like those known from SETI@home.
The LarKC major objectives are:
Design an integrated pluggable platform for large-
scale semantic computing.
Construct a reference implementation for such an in-
tegrated platform for large-scale semantic computing,
including a fully functional set of baseline plug-ins.
Achieve sufficient conceptual integration between ap-
proaches of heterogeneous fields (logical inference,
databases, machine learning, cognitive science) to en-
able the seamless integration of components based on
methods from these diverse fields.
Demonstrate the effectiveness of the reference im-
plementation through applications in services based
on data-aggregation from mobile-phone users, meta-
analysis of scientific literature in cancer research, and
data-integration and -analysis in early clinical develop-
ment and the drug-discovery pipeline.
2.3. Strategies
The Large Knowledge Collider will consist of a number
of pluggable components: retrieval, abstraction, selection,
reasoning and deciding. These components are combined
in a simple algorithmic schema, which is shown in Algo-
rithm 1. Researchers can design and experiment with mul-
tiple realizations for each of these components.
Algorithm 1 Algorithmic Schema
loop
Obtain a selection of data (RETRIEVAL)
transform to an appropriate representation (ABSTRACTION)
draw a sample (SELECTION)
reason on the sample (REASONING)
if more time is available
and/or the result is not good enough (DECIDING) then
increase the sample size (RETRIEVAL)
else
exit
end if
end loop
In LarKC, massive, distributed and necessarily incom-
plete reasoning is performed over web-scale knowledge
sources. Massive inference is achieved by distributing prob-
lems across heterogeneous computing resources and coor-
dinated by the LarKC platform. This overall architecture is
depicted in Figure 1. Some of the distributed computational
resources will run highly coupled, high performance infer-
ence on local parallel hardware before communicating re-
sults back to the distributed computation. With Web-scale
inference, complete information is an empty hope and the
distributed computation shown in the left part of Figure 1
include some failed computations that do not thwart the en-
tire problem solving task. The right side of the figure illus-
trates the architecture that achieves this. LarKC allocates
resources strategically and tactically, according to its basic
algorithmic schema, to: 1) Retrieve raw content and asser-
tions that may contribute to a solution, 2) Abstract that in-
formation into the forms needed by its heterogeneous rea-
soning methods, 3) Select the most promising approaches
to try first, 4) Reason, using multiple deductive, inductive,
abductive, and probabilistic means to move closer to a so-
lution given the selected methods and data, and 5) Decide
whether sufficiently many, sufficiently accurate and precise
solutions have been found, and, if not, whether it’s worth
trying harder. This basic problem framework is supplied as
a plug-in architecture, allowing intra-consortium and extra-
consortium researchers and users to experiment with im-
provements to automated reasoning.
Figure 1: LarKC Inference Strategy and Architecture
This architecture will enable the productive yet friction-
less interaction between components and the various disci-
plines related with them, which are shown in Table 1.
Plugin Component Based on results in...
RETRIEVAL Information Retrieval,
Cognition, ...
ABSTRACTION Machine Learning, Ontology,...
SELECTION Statistics, Machine Learning,
Cognition, ...
REASONING Logic, Probabilistic inference, ...
DECIDING Economics, Computing,
Decision theory...
Table 1: Disciplines
3. State-of-the-art in Relevant Areas
3.1. Reasoning
Researchers have developed methods for reasoning in
rather small, closed, trustworthy, consistent, and static do-
mains. They usually provide a small set of axioms and
facts. DL reasoners can deal with 10
5
axioms (concept def-
initions), but they scale poorly for large instance sets. Logic
programming engines can deal with similar-sized rule sets
as well as larger instance sets (say, 10
6
), but they can draw
only simple logical conclusions from these theories. Both
streams are highly interesting areas of research, and topics
such as how to combine them attract a lot of attention (i.e.
[5], [3]).
Still, there is a deep mismatch between reasoning on a
Web scale and efficient reasoning algorithms over restricted
subsets of first-order logic. This is rooted in underlying as-
sumptions of current systems for computational logic:
Small set of axioms. If the Web is to capture the en-
tirety of human knowledge, the number of axioms ends
up being very large.
Small number of facts. Assuming a Google count of
roughly 30 billion Web pages and a modest estimate
of 100 facts per page, we consequently are in the order
of a trillion facts.
Completeness of inference rules. The Web is open,
with no defined boundaries. Therefore, completeness
is an unrealistic requirement for an inference proce-
dure in this context.
Correctness of inference rules and consistency. Tradi-
tional logic takes axioms as reflecting truth and tries to
infer the implicit knowledge they provide (their deduc-
tive closure). In a Web context, information is unreli-
able from the beginning, which means even a correct
inference engine cannot maintain truth.
Static domains. The Web is a dynamic entity: the
known facts will change during the process of acquir-
ing and using them.
Each of these assumptions needs to be revisited and adapted
to the reality of the web.
Merging reasoning and Web-search: The only way to
bridge the divide between reasoning and the Web is to inter-
weave the reasoning process with the process of establish-
ing the relevant facts and axioms through retrieval (ranking
or selection) and abstraction (compressing or transforming
information). In this way, retrieval and reasoning become
two sides of the same coin a process that aims for useful
information derived from data on the Web.
From completeness and soundness to recall and preci-
sion: The project will fill a significant gap between the cur-
rent methods for indexing and searching on the one hand,
and reasoning on the other hand. To illustrate this gap, let’s
consider the typical success measures used by the searching
and the reasoning communities. The conventional method
for measuring success in search task is in terms of the levels
of precision and recall of the results. It is well known that
higher precision leads to lower recall, and vice versa.
Turning to reasoning, the majority of previous work has
considered processes that are sound and complete, which, in
an information seeking context, equates to 100% precision
and 100% recall. In practice this type of perfection has both
high setup costs (codifying domain expertise in an ontology
and semantically structuring the entire information space)
and of computation. The Large Knowledge Collider will
exploit the space of possibilities above the precision/recall
curve but below perfect completeness and soundness, using
methods that trade off these properties of logic in exchange
for lower costs of creation and computation. Pictorially,
the project positions itself in the space between traditional
search and traditional reasoning.
Feature-based comparison with existing state of the art
reasoning platforms is shown in Table 2.
Feature SoA LarKC
Reasoning method Logic only Multiple methods
Heuristics Hardwired Configurable plugins
Precision/recall 100/100 Configurable trade-off
Dynamic axioms Not Supported
and facts supported
RDF retrieval scale O(10
9
) O(10
12
)
RDF inference scale O(10
8
) O(10
10
)
Anytime behavior None Configurable
Parallelism None local cluster
O(10
2
) machines
wide-area distribution
O(10
3
) machines
Table 2: Feature-based comparison with existing state of
the art reasoning platforms.
3.2. Scalable Semantic Repositories
Ontotext produces high-performance RDF- and OWL-
storage and inference technology and actively monitors the
state-of-the-art in this area. [10] provides extensive perfor-
mance figures which we summarize in Table 3; it provides
a comparison of the tools in terms of scalability, speed, and
inference capabilities. The Cycorp inference engine [8],
which is able to deal with knowledge bases containing sev-
eral millions of assertions, will be used complementary to
Ontotext technology with respect to reasoning tasks.
All the figures refer to the best published results for the
corresponding tools. Direct comparisons of the tools should
be made with caution. As with the relational DBMS, an en-
gine’s performance can vary considerably depending on the
configuration and tuning with respect to the specific task
or benchmark. Semantic repositories are even harder to
compare because they also perform inference, which can
be implemented in a wide variety of modalities - for ex-
ample, forward-chaining harms the loading performance,
while backward-chaining slows down the query evaluation.
The storage and querying functionality of the Large Knowl-
edge Collider will be based on the Ontotext technology be-
hind SwiftOWLIM and BigOWLIM [7] and on related tech-
nology provided by CycEur. Consequently we are indeed
Tool Scale (mil.
of statem.)
Inference Load Speed in
1000 st./sec
Hardware
(GB RAM)
KAON2 ˜ 10
OWL DL +
rules
20 0.5
RacerPro 1 OWL DL - 0.5
Minerva (IBM) 2 OWL Lite +/- >1 0.5
Triple20 40 OWL Lite +/- 6 2
SwiftOWLIM 10-80 OWL Lite +/- 20-60 1-16
Sesame 2.0 NS 70 RDFS + 6 0.8
ORACLE10R2 100 RDFS + >1 2
Jena v2.1/2.3 7-200 - ?-6 2-?
KOWARI 235 None 4 ?
RDFGateway 262 OWL Lite +/- >1 ?
AlegroGraph 1 000 RDFS - 20 2
OpenLink 1 060 None 12 8
BigOWLIM 1 060 OWL Lite +/- 4 12
Table 3: Scalable Semantic Repositories
building on top of the best that the current state-of-the-art
has to offer.
In a very recent development, DERI Galway has imple-
mented a distributed RDF store which reported real-time
performance with up to 7 billion triples [6]. This system is
limited to only retrieving statements, and does not perform
any inference. Clearly, this impressive technology is an ex-
cellent basis for implementing basic indexing and lookup
techniques, but is by itself not enough. The essential differ-
ence here is similar to that between a database and a heuris-
tic inference system. The Large Knowledge Collider is aim-
ing at inference, not only at lookup, with search as its major
paradigm, as opposed to retrieval. Clearly no amount of im-
provement on efficient indexing systems will be able to beat
the exponential search-space generated by inference.
The Large Knowledge Collider will of course exploit
such state-of-the-art indexing systems but is aiming deliber-
ately for imperfect recall and precision as means to achieve
unlimited scalability (through exchanging quality for per-
formance), while maintaining task-related bounded ratio-
nality.
3.3. Web Search
Information retrieval (IR) technology has proliferated in
rough proportion to the expansion of knowledge and infor-
mation as a central factor in economic success. The main
dimensions conditioning choice of search technology are:
Volume. The GYM big three search engines (Google,
Yahoo!, Microsoft) deliver sub-second responses to hun-
dreds of millions of queries daily over hundreds of terabytes
of data. At the other end of the scale desktop search sys-
tems can rely on substantial computing resources relative to
a small data set.
Value. The retrieval of high-value content (typically
within corporate intranets or behind pay-for-use turnstiles)
is often mission-critical for the organization that owns the
content. For example the BBC allocates a skilled staff mem-
ber for eight hours per broadcast hour to index their most
important content.
To process web-scale volumes GYM use a combination
of one of the oldest and simplest retrieval data structures (an
inverted file that relates search terms to documents) and a
ranking algorithm whose most important component is de-
rived from the link structure of the Web. In general, when
specifying a search, users enter a small number of terms
as a query, based on words that people expect to occur
in the types of document they seek. This gives rise to a
fundamental problem, known as ”index term synonymy”:
not all documents will use the same words to refer to the
same concept. Therefore, not all the documents that dis-
cuss the concept will be retrieved by a simple keyword-
based search. Furthermore, query terms may of course have
multiple meanings; this problem is called ”query term pol-
ysemy”. The ambiguity of the query can lead to retrieval
of irrelevant information and/or failure to retrieve relevant
information. High-value (or low-volume) content retrieval
systems address these problems with a variety of semantics-
based approaches that attempt to perform conceptual index-
ing and logical querying. For example, the BBC system
mentioned previously performs indexing using a thesaurus
of 100,000 terms that generalizes over anticipated search
terms. Life Sciences publication databases increasingly use
rich terminological resources to support conceptual navi-
gation (MeSH, the Gene Ontology, Snomed, the unified
UMLS system, etc). Our task in LarKC is to show how
the high-value techniques can scale to higher volumes than
is currently feasible.
3.4. Semantic Spaces
Data spaces are shared virtual data spaces which are de-
signed for concurrent access by multiple processes. The
data published and exchanged in the space is generally
expressed as an atomic unit, called a tuple, consisting of
typed fields. Processes interact with spaces through a min-
imal language or an interface which expresses a coordi-
nation model with concurrency. The idea of considering
space-based communication for the Semantic Web has led
to the vision of Triple Spaces [4], middleware which allows
shared access to semantic information in a structured and
reliable way. Triple Spaces emphasizes the use of the space
for the publication and retrieval of ”payload” data that rep-
resent the know-how published on the Web. The structured
representation of RDF triples in a space allows for seman-
tic reasoning about this data, whilst the space guarantees
the reliable and efficient placement of this knowledge, and
additionally provides a smart interface including near-time
notification about changes.
A first European project on semantic spaces has been ini-
tiated in April 2006. The project TripCom provides the
foundational conceptual work for semantic spaces and a
prototypical implementation is expected for 2008. How-
ever, as the first project in this field, TripCom uses primar-
ily RDF as a representation language for space data, and ex-
ploits reasoning on semantic data only to a limited extent.
The latter is also due to the the lack of appropriate instru-
ments to combine reasoning with the coordination model
underlying such systems.
A first step in advancing the state-of-the-art will thus
have to aim at taking up the ideas towards fully Seman-
tic Web-enabled spaces and to design and implement the
extensions or changes needed in order to feasibly support
also more expressive Semantic Web formalisms, and to em-
bed reasoning into distributed spaces. Further on, as a dis-
tributed space middleware will be available, the semantics
of the underlying coordination model (e.g. the meaning of
the operations for publishing and retrieving data) will need
to be rethought in order to adapt to the particularities of an
open globally networked environment such as the Web, in
which completeness and correctness can not be guaranteed.
With this respect, results of the LarKC project could con-
tribute to the further development of space-based technol-
ogy.
4. Use Cases
4.1. Real Time City
One of the major technical problems that hinder the de-
velopment of real time control to cities is the difficulty in
answering massive numbers of queries. These queries re-
quiring reasoning about huge streams of facts, originating
in real time from heterogeneous and noisy data sources, in
order to control traffic flow to help citizens get where they
need to be. For instance, consider that 30-40% of car fuel
in large cities is spent by drivers in looking for a parking lot
and this share dramatically increases when big events in-
volving lots of people take place. Image if it were possible
to predict potential congestion problems, and help people
avoid the congestion to ensure that they reach the event in
time.
A large amount of information is already available at
almost no cost: all the commercial activities and meeting
places (cf. Google Earth), all events scheduled in the city
and their location, all the position and speed information
from public transportation vehicles and mobile phone users,
and the availability of all free parking lots in specific park-
ing areas and their position. For a large city it is not hard to
imagine that there could be billions of triples that are con-
tinuously being updated.
A reasoner of new generation is clearly needed in order
to infer for each citizen, who wants to attend an event, the
most convenient parking area, a place to meet with friends,
the fastest route to such place to instruct the car GPS. And
time constrains for such a reasoner are very demanding (i.e.,
few ms per query) if they take into consideration that citi-
zens are free to follow or discard the suggestions proposed
by the city real time control system, therefore continuous
inference is required.
4.2. Data Integration for Early Clinical
Drug Development
The number of different databases used by life-science
researchers, as well as the diversity of their schemata and
intrinsic semantics, makes semantic data integration ex-
tremely relevant to this domain.
Data integration is one of the most challenging, expen-
sive and pressing IT problem tackled by the pharmaceutical
companies in their drug development process. In the recent
years, there is an increasing generation of pharmaceutical
data as result of the R&D process. With high-throughput
technologies it is possible to analyze thousands of genes and
gene products in a very short time. Despite the high vol-
ume of newly discovered biological data, the pharmaceuti-
cal companies have so far not managed to integrate this pa-
tient, disease, efficacy and safety data in a way to meet their
expectations of increased productivity and pipeline through-
put. In 2005, there were 719 publicly available databases,
171 more than in 2004. These exhibit extreme heterogene-
ity of information type and there is no single database that
supplies a complete picture for decision making in drug dis-
covery and development. Most of the databases are devel-
oped independently and address local and specific needs.
The volume of the data, the degree of interconnection
and the complexity of the reasoning necessary for the in-
tegration, puts the problem far beyond the capabilities of
contemporary inference engines.
There are very few engines which can load the OWL
variant of the UNIPROT database and reason on top of
it. Even those which can, suffer performance which is
impractical for most applications. Integration with other
databases or with experimental results is simply unthinkable
at present. we aim for semantic integration of life sciences
databases into an ontology-structured knowledge base and
annotation of scientific papers both post-hoc and during au-
thoring with respect to the knowledge base. These scenarios
will pose heavy loads on LarKC components and thus test
their performance in practice.
5. Conclusion
With the rapid growth of available data and knowledge
in standardised Semantic Web formats, there is a pressing
need for a reasoning platform that can handle very large
volumes of such data.
Furthermore the Web is an architecture of standards and
formalisms for many heterogeneous applications, fulfilling
very different demands. And while this multiple-facets are
one of the very building blocks of the Web they also imply
that basic assumptions of reliability, trustworthiness, and
consistency not necessarily hold. Moreover, as the Web
is a democratic and open medium in which “publishing is
cheap” [1], we should even expect inconsistency (due to
malicious content or simply disagreement about a state of
things).
LarKC aims to be the platform to address these issues,
and is built on the following principles:
Achieve scalability through parallelisation. Different
possibilities are offered either through tight integra-
tion of parallel processes on cluster-style hardware,
or through much looser coupled wide-area distributed
computing.
Achieve scalability through giving up completeness.
Partial reasoning results are useful in many domains
of application. Significant speedups and can be ob-
tained by incompleteness in many stages of the rea-
soning process, ranging from selection of the axioms
to incomplete reasoning over those axioms.
Do not build a single reasoning engine that is sup-
posed to be suited for all kinds of use-cases, but in-
stead build a configurable platform on which differ-
ent components can be plugged in to obtain different
scale/efficiency trade-offs, as required by different use-
cases.
References
[1] T. Berners-Lee, W. Hall, J. Hendler, K. O’Hara, N. Shadbolt,
and D. Weitzner. A framework for web science. Foundations
and Trends in Web Science, 1(1):1–130, 2006.
[2] R. Brachman and H. Levesque. The tractability of subsump-
tion in frame-based description languages. Proc. of the 4th
Nat. Conf. on Artificial Intelligence (AAAI-84), pages 34–37,
1984.
[3] F. Donini, M. Lenzerini, D. Nardi, and A. Schaerf. AL-
log: Integrating Datalog and Description Logics. Journal of
Intelligent Information Systems, 10(3):227–252, 1998.
[4] D. Fensel. Triple-space computing: Semantic Web Services
based on persistent publication of information. Proceed-
ings of the IFIP International Conference on Intelligence in
Communication Systems, 2004.
[5] B. Grosof, I. Horrocks, and R. Volz. Description logic pro-
grams: combining logic programs with description logic.
Proceedings of the 12th international conference on World
Wide Web, pages 48–57, 2003.
[6] A. Harth, J. Umbrich, A. Hogan, and S. Decker. Yars2: A
federated repository for querying graph structured data from
the web. In Proceedings of the 6th International Semantic
Web Conference and 2nd Asian Semantic Web Conference
(ISWC/ASWC2007), Busan, South Korea, volume 4825 of
LNCS, pages 211–224. Springer Verlag, November 2007.
[7] A. Kiryakov, D. Ognyanov, and D. Manov. OWLIM–a Prag-
matic Semantic Repository for OWL. WISE Workshops,
pages 182–192, 2005.
[8] D. Lenat. CYC: a large-scale investment in knowledge in-
frastructure. Communications of the ACM, 38(11):33–38,
1995.
[9] M. Luther, S. Bohm, M. Wagner, and J. Koolwaaij. En-
hanced presence tracking for mobile applications. Proc. of
ISWC, 5, 2005.
[10] D. Ognyanoff, A. Kiryakov, R. Velkov, and M. Yankova. D2.
6.3 A scalable repository for massive semantic annotation.
Technical report, SEKT, 2007.
[11] H. Simon. A behavioral model of rational choice. Quarterly
Journal of Economics, 69(1):99–118, 1955.
... To explore relationship between emotion and probiotics with knowledge graphs, we use the seven-stepapproach as described in [21], and was further developed in the work of the construction of Knowledge Graphs of Depression [22] and the construction of Knowledge Graphs of Kawasaki Disease [23]. That seven-stepapproach consists of the following seven steps: identification, transformation, cleaning, integration, storage/ indexing, query/reasoning and applications. ...
Article
Full-text available
PurposeResearchers have identified gut microbiota that interact with brain regions associated with emotion and mood. Literature reviews of those associations rely on rigorous systematic approaches and labor-intensive investments. Here we explore how knowledge graph, a large scale semantic network consisting of entities and concepts as well as the semantic relationships among them, is incorporated into the emotion-probiotic relationship exploration work.Method We propose an end-to-end emotion-probiotics relationship exploration method with an integrated medical knowledge graph, which incorporates the text mining output of knowledge graph, concept reasoning and evidence classification. Specifically, a knowledge graph for probiotics is built based on a text-mining analysis of PubMed, and further used to retrieve triples of relationships with reasoning logistics. Then specific relationships are annotated and evidence levels are retrieved to form a new evidence-based emotion-probiotic knowledge graph.ResultsBased on the probiotics knowledge graph with 40,442,404 triples, totally 1453 PubMed articles were annotated in both the title level and abstract level, and the evidence levels were incorporated to the visualization of the explored emotion-probiotic relationships. Finally, we got 4131 evidenced emotion-probiotic associations.Conclusions The evidence-based emotion-probiotic knowledge graph construction work demonstrates an effective reasoning based pipeline of relationship exploration. The annotated relationship associations are supposed be used to help researchers generate scientific hypotheses or create their own semantic graphs for their research interests.
... Other ETL pipelines system exists such as Apache NiFi. Semantic Web Pipes [31] and LarKC [17] are other prominent examples. End-to-End Extraction Systems: More recently, end-to-end systems are gaining more attention due to the boom of deep learning techniques. ...
Chapter
We propose Plumber, the first framework that brings together the research community’s disjoint information extraction (IE) efforts. The Plumber architecture comprises 33 reusable components for various Knowledge Graphs (KG) information extraction subtasks, such as coreference resolution, entity linking, and relation extraction. Using these components, Plumber dynamically generates suitable information extraction pipelines and offers overall 264 distinct pipelines. We study the optimization problem of choosing suitable pipelines based on input sentences. To do so, we train a transformer-based classification model that extracts contextual embeddings from the input and finds an appropriate pipeline. We study the efficacy of Plumber for extracting the KG triples using standard datasets over two KGs: DBpedia, and Open Research Knowledge Graph (ORKG). Our results demonstrate the effectiveness of Plumber in dynamically generating KG information extraction pipelines, outperforming all baselines agnostics of the underlying KG. Furthermore, we provide an analysis of collective failure cases, study the similarities and synergies among integrated components, and discuss their limitations.
... Other ETL pipelines system exists such as Apache NiFi. Semantic Web Pipes [31] and LarKC [17] are other prominent examples. End-to-End Extraction Systems: More recently, end-to-end systems are gaining more attention due to the boom of deep learning techniques. ...
Preprint
Full-text available
In the last decade, a large number of Knowledge Graph (KG) information extraction approaches were proposed. Albeit effective, these efforts are disjoint, and their collective strengths and weaknesses in effective KG information extraction (IE) have not been studied in the literature. We propose Plumber, the first framework that brings together the research community's disjoint IE efforts. The Plumber architecture comprises 33 reusable components for various KG information extraction subtasks, such as coreference resolution, entity linking, and relation extraction. Using these components,Plumber dynamically generates suitable information extraction pipelines and offers overall 264 distinct pipelines.We study the optimization problem of choosing suitable pipelines based on input sentences. To do so, we train a transformer-based classification model that extracts contextual embeddings from the input and finds an appropriate pipeline. We study the efficacy of Plumber for extracting the KG triples using standard datasets over two KGs: DBpedia, and Open Research Knowledge Graph (ORKG). Our results demonstrate the effectiveness of Plumber in dynamically generating KG information extraction pipelines,outperforming all baselines agnostics of the underlying KG. Furthermore,we provide an analysis of collective failure cases, study the similarities and synergies among integrated components, and discuss their limitations.
... There are already several kinds of algorithms that are candidates for the requested matching algorithms [7] as well as new forms of inference currently in development like LarkC [8] [9]. ...
Conference Paper
Full-text available
Information exchange in a NATO Network Enabled Capabilities context (need to share) requests a correct understanding of the meaning of the provided information and equally important a filtering process to enable focusing on relevant information while accessing other information remains possible. This paper describes a filtering process where task-based user information requirements, described in an information Consumer Profile, is used to extract from all information available only that part that is relevant for the tasks the user is executing. The available information is annotated in an Information Profile assuring the requested semantic interoperability. Reliability of information, trustworthiness of producers of information and the validity of information in time are considered. Optionally an intended audience can be specified. The selected information goes through a delivery process assuring the right format of the information the user can handle at a specific moment in time, induced by device capabilities and possible restrictions related to the used communication networks. Conscious use of the filtering process presented will enable a critical evaluation of information exchange activities, providing guidelines to improve the quality of interaction on NEC enabled environments.
... to statistical models is that logic is modular, which allows the user to (more) easily incorporate knowledge to an existing knowledge base. For example, the Semantic Web has shown that knowledge bases can become extremely large, and sensible reasoning is still possible [3]. Nevertheless, first-order logic lacks a sense of 'strength', that it, the representation does not provide means to decide which inference is more probable than others [16]. ...
Chapter
The Cambridge Handbook of Computational Cognitive Sciences is a comprehensive reference for this rapidly developing and highly interdisciplinary field. Written with both newcomers and experts in mind, it provides an accessible introduction of paradigms, methodologies, approaches, and models, with ample detail and illustrated by examples. It should appeal to researchers and students working within the computational cognitive sciences, as well as those working in adjacent fields including philosophy, psychology, linguistics, anthropology, education, neuroscience, artificial intelligence, computer science, and more.
Chapter
Since its inception by Google, Knowledge Graph has become a term that is recently ubiquitously used yet does not have a well-established definition. This section attempts to derive a definition for Knowledge Graphs by compiling existing definitions made in the literature and considering the distinctive characteristics of previous efforts for tackling the data integration challenge we are facing today. Our attempt to make a conceptual definition is complemented with an empirical survey of existing Knowledge Graphs. This section lays the foundation for the remainder of the book, as it provides a common understanding on certain concepts and motivation to build Knowledge Graphs in the first place.
Chapter
Intelligent Personal Assistants are changing the way we access the information on the web as search engines changed it years ago. Undoubtfully, an important factor that enables this way of consuming the web is the schema.org annotations on websites. Those annotations are extracted and then consumed by search engines and Intelligent Personal Assistants to support tasks like question-answering. In this section we explain how Knowledge Graphs built based on content, data, and service annotations can improve search engine results and conversational systems. We first give a brief overview of the history of the Internet, AI, and web and the role semantic technologies is playing in bringing those three to the point we are today. Then we show the need for an abstraction layer over Knowledge Graphs where we can create different knowledge views in order to achieve scalable curation, reasoning, and access control. Finally, we show how Knowledge Graphs can power conversational agents in different points in the dialog system pipeline and the promising future of service annotations helping to build flexible systems decoupled from the web services with which they communicate.
Chapter
In this tutorial we describe the approaches to non monotonic reasoning as a means for inference on the web. In particular we are focusing on the ways in which reasoning technologies have adapted to five different issues of the modern era world wide web: (a) epistemic aspects, bound by the new models of the social web, (b) changes over time, (c) language variants, including different languages of deployment of a web site, (d) agent-based knowledge deployment, due to social networks and blogs, (e) dialogue aspects, introduced again in blogs and social networks. The presentation covers these aspects by a technical viewpoint, including the introduction of specific knowledge-driven methods. The technical issues will be provided within a general logical framework known as defeasible logic.
Article
As more and more data is being generated by sensor networks, social media and organizations, the Web interlinking this wealth of information becomes more complex. This is particularly true for the so-called Web of Data, in which data is semantically enriched and interlinked using ontologies. In this large and uncoordinated environment, reasoning can be used to check the consistency of the data and of associated ontologies, or to infer logical consequences which, in turn, can be used to obtain new insights from the data. However, reasoning approaches need to be scalable in order to enable reasoning over the entire Web of Data. To address this problem, several high-performance reasoning systems, which mainly implement distributed or parallel algorithms, have been proposed in the last few years. These systems differ significantly; for instance in terms of reasoning expressivity, computational properties such as completeness, or reasoning objectives. In order to provide a first complete overview of the field, this paper reports a systematic review of such scalable reasoning approaches over various ontological languages, reporting details about the methods and over the conducted experiments. We highlight the shortcomings of these approaches and discuss some of the open problems related to performing scalable reasoning.
Article
Full-text available
We present an early prototype for semantic-based monitoring of mobile users from our work in the IST project MobiLife. The system enables rich presence tracking in combining Web Service tech- nology with an XML-based metamodel to capture presence data that is linked to semantic descrip- tions formulated using OWL ontologies.
Conference Paper
Full-text available
OWLIM is a high-performance Storage and Inference Layer (SAIL) for Sesame, which performs OWL DLP reasoning, based on forward-chaining of entilement rules. The reasoning and query evaluation are performed in- memory, while in the same time OWLIM provides a reliable persistence, based on N-Triples files. This paper presents OWLIM, together with an evaluation of its scalability over synthetic, but realistic, dataset encoded with respect to PROTON ontology. The experiment demonstrates that OWLIM can scale to millions of statements even on commodity desktop hardware. On an almost- entry-level server, OWLIM can manage a knowledge base of 10 million ex- plicit statements, which are extended to about 19 millions after forward chain- ing. The upload and storage speed is about 3,000 statement/sec. at the maximal size of the repository, but it starts at more than 18,000 (for a small repository) and slows down smoothly. As it can be expected for such an inference strategy, delete operations are expensive, taking as much as few minutes. In the same time, a variety of queries can be evaluated within milliseconds. The experiment shows that such reasoners can be efficient for very big knowledge bases, in scenarios when delete operations should not be handled in real-time.
Article
Full-text available
This text sets out a series of approaches to the analysis and synthesis of the World Wide Web, and other web-like information structures. A comprehensive set of research questions is outlined, together with a sub-disciplinary breakdown, emphasising the multi-faceted nature of the Web, and the multi-disciplinary nature of its study and development. These questions and approaches together set out an agenda for Web Science, the science of decentralised information systems. Web Science is required both as a way to understand the Web, and as a way to focus its development on key communicational and representational requirements. The text surveys central engineering issues, such as the development of the Semantic Web, Web services and P2P. Analytic approaches to discover the Web’s topology, or its graph-like structures, are examined. Finally, the Web as a technology is essentially socially embedded; therefore various issues and requirements for Web use and governance are also reviewed.
Article
Full-text available
We show how to interoperate, semantically and inferentially, between the leading Semantic Web approaches to rules (RuleML Logic Programs) and ontologies (OWL/DAML+OIL Description Logic) via analyzing their expressive intersection. To do so, we define a new intermediate knowledge representation (KR) contained within this intersection: Description Logic Programs (DLP), and the closely related Description Horn Logic (DHL) which is an expressive fragment of first-order logic (FOL). DLP provides a significant degree of expressiveness, substantially greater than the RDF-Schema fragment of Description Logic. We show how to perform DLP-fusion: the bidirectional translation of premises and inferences (including typical kinds of queries) from the DLP fragment of DL to LP, and vice versa from the DLP fragment of LP to DL. In particular, this translation enables one to "build rules on top of ontologies": it enables the rule KR to have access to DL ontological definitions for vocabulary primitives (e.g., predicates and individual constants) used by the rules. Conversely, the DLP-fusion technique likewise enables one to "build ontologies on top of rules": it enables ontological definitions to be supplemented by rules, or imported into DL from rules. It also enables available efficient LP inferencing algorithms/implementations to be exploited for reasoning over large-scale DL ontolo
Conference Paper
We show how to interoperate, semantically and inferentially, between the leading Semantic Web approaches to rules (RuleML Logic Programs) and ontologies (OWL/DAML+OIL Description Logic) via analyzing their expressive intersection. To do so, we define a new intermediate knowledge representation (KR) contained within this intersection: Description Logic Programs (DLP), and the closely related Description Horn Logic (DHL) which is an expressive fragment of first-order logic (FOL). DLP provides a significant degree of expressiveness, substantially greater than the RDF-Schema fragment of Description Logic. We show how to perform DLP-fusion: the bidirectional translation of premises and inferences (including typical kinds of queries) from the DLP fragment of DL to LP, and vice versa from the DLP fragment of LP to DL. In particular, this translation enables one to "build rules on top of ontologies": it enables the rule KR to have access to DL ontological definitions for vocabulary primitives (e.g., predicates and individual constants) used by the rules. Conversely, the DLP-fusion technique likewise enables one to "build ontologies on top of rules": it enables ontological definitions to be supplemented by rules, or imported into DL from rules. It also enables available efficient LP inferencing algorithms/implementations to be exploited for reasoning over large-scale DL ontologies.
Article
Introduction, 99. — I. Some general features of rational choice, 100.— II. The essential simplifications, 103. — III. Existence and uniqueness of solutions, 111. — IV. Further comments on dynamics, 113. — V. Conclusion, 114. — Appendix, 115.
Conference Paper
A knowledge representation system provides an important ser- vice to the rest of a knowledge-based system: it computes au- tomatically a set of inferences over the beliefs encoded within it. Given that the knowledge-based system relies on these infer- ences in the midst of its operation (i.e., its diagnosis, planning, or whatever), their computational tractability is an important concern. Here we present evidence as to how the cost of comput- ing one kind of inference is directly related to the expressiveness of the representation language. As it turns out, this cost is per- ilously sensitive to small changes in the representation language. Even a seemingly simple frame-based description language can pose intractable computational obstacles.
Conference Paper
We present the architecture of an end-to-end semantic search engine that uses a graph data model to enable interactive query answer- ing over structured and interlinked data collected from many disparate sources on the Web. In particular, we study distributed indexing meth- ods for graph-structured data and parallel query evaluation methods on a cluster of computers. We evaluate the system on a dataset with 430 million statements collected from the Web, and provide scale-up experi- ments on 7 billion synthetically generated statements.
Conference Paper
This paper discusses possible routes to moving the web from a collection of human readable pieces of information connecting humans, to a web that connects computing devices based on machine-processable semantics of data and distributed computing. The current shortcomings of web service technology are analyzed and a new paradigm for fully enabled semantic web services is proposed which is called triple-based or triple-space computing.
Article
Introduction, 99.--I. Some general features of rational choice, 100.--II. The essential simplifications, 103.--III. Existence and uniqueness of solutions, 111.--IV. Further comments on dynamics, 113.--V. Conclusion, 114.--Appendix, 115.