Article

Analyzing Real-World SPARQL Queries and Ontology-Based Data Access in the Context of Probabilistic Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Handling uncertain knowledge is crucial for modeling many real world domains. Ontologies and ontology-based data access (OBDA) have proven to be versatile methods to capture this knowledge. Multiple systems for OBDA have been developed and there is theoretical work towards probabilistic OBDA, namely identifying efficiently processable queries. These queries are called safe queries. However, there is no analysis on the safeness of probabilistic queries in real-world applications, and there exists no tool support for applying the existing formalisms to the standard query language of SPARQL. In this paper we investigate queries collected from several public SPARQL endpoints and determine the distribution of safe and unsafe queries. This analysis shows that many queries in practice are safe, making probabilistic OBDA feasible and practical to fulfill real-world users' information needs. Furthermore, we design and conduct benchmarks on real-world and generated data sets which show that the approach of answering safe queries in the appropriate way is scalable to large amounts of data.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The RDFization process can now be scaled as it leverages Apache Spark. 2 -We have evaluated the new queries locally in a Virtuoso instance in order to gain runtime statistics (including estimates of the number of results, the selectivity of patterns, overall runtimes, etc.), and have updated the statistical analysis of the queries featured by LSQ to include the additional data provided by the new endpoints. -Since the initial release, LSQ has been used by a variety of diverse research works on SPARQL [2,3,11,14,15,17,18,21,22,26,[30][31][32]34,35,37,39,41,42,49,58,69,71,[74][75][76][78][79][80]83,[85][86][87][89][90][91]94,[97][98][99]102]. To exemplify the value of LSQ, we discuss the various ways in which the dataset has been used in these past years. ...
... -The aforementioned analyses by Han et al. [41] and Bonifati et al. [21,22] suggest that well-designed patterns, queries of bounded treewidth, etc., make for promising fragments. -In the context of probabilistic Ontology-Based Data Access (OBDA), Schoenfisch and Stuckenschmidt [85] analyse the ratio of safe queries -whose evaluation is tractable in data complexity -versus unsafe querieswhose evaluation is #P-hard; they show that over 97.9% of the LSQ queries are safe, and can be efficiently evaluated. -Song et al. [87] use LSQ to analyse how nested OPTIONAL clauses affect query response times; they propose a way to approximate solutions for deeply-nested well-designed patterns. ...
Article
Full-text available
We present the Linked SPARQL Queries (LSQ) dataset, which currently describes 43.95 million executions of 11.56 million unique SPARQL queries extracted from the logs of 27 different endpoints. The LSQ dataset provides RDF descriptions of each such query, which are indexed in a public LSQ endpoint, allowing interested parties to find queries with the characteristics they require. We begin by describing the use cases envisaged for the LSQ dataset, which include applications for research on common features of queries, for building custom benchmarks, and for designing user interfaces. We then discuss how LSQ has been used in practice since the release of four initial SPARQL logs in 2015. We discuss the model and vocabulary that we use to represent these queries in RDF. We then provide a brief overview of the 27 endpoints from which we extracted queries in terms of the domain to which they pertain and the data they contain. We provide statistics on the queries included from each log, including the number of query executions, unique queries, as well as distributions of queries for a variety of selected characteristics. We finally discuss how the LSQ dataset is hosted and how it can be accessed and leveraged by interested parties for their use cases.
... This grounding step has received little attention, as its cost is dominated by the cost of constructing the propositional formula in typical PLP benchmarks that operate on biological, social or hyperlink networks, where formulas are complex. However, it has been observed that the grounding step is the bottleneck that often makes it impossible to apply PLP inference in the context of ontology-based data access over probabilistic data (pOBDA) (Schoenfisch and Stuckenschmidt 2017; Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ...
... As vProbLog aims to improve the reasoning phase, but essentially keeps the knowledge compilation phase unchanged, we distinguish two types of benchmarks: those where the bottleneck is reasoning, i.e., propositional formulas are relatively small, but a large amount of reasoning may be required to identify the formulas, and those where the bottleneck is knowledge compilation, i.e., propositional formulas become complex quickly. As a representative of the former, we use a probabilistic version of LUBM (Guo, Pan, and Heflin 2011), a setting known to be challenging for ProbLog (Schoenfisch and Stuckenschmidt 2017;van Bremen, Dries, and Jung 2019); for the latter type, we use three benchmarks from the ProbLog literature (Fierens et al. 2015;Renkens et al. 2014;Vlasselaer et al. 2016) that essentially are all variations of network connectivity queries: ...
Article
State-of-the-art inference approaches in probabilistic logic programming typically start by computing the relevant ground program with respect to the queries of interest, and then use this program for probabilistic inference using knowledge compilation and weighted model counting. We propose an alternative approach that uses efficient Datalog techniques to integrate knowledge compilation with forward reasoning with a non-ground program. This effectively eliminates the grounding bottleneck that so far has prohibited the application of probabilistic logic programming in query answering scenarios over knowledge graphs, while also providing fast approximations on classical benchmarks in the field.
... This grounding step has received little attention, as its cost is dominated by the cost of constructing the propositional formula in typical PLP benchmarks that operate on biological, social or hyperlink networks, where formulas are complex. However, it has been observed that the grounding step is the bottleneck that often makes it impossible to apply PLP inference in the context of ontology-based data access over probabilistic data (pOBDA) [Schoenfisch and Stuckenschmidt, 2017;van Bremen et al., 2019], where determining the relevant grounding explores a large search space, but only small parts of this space contribute to the formulas. ...
... As vProbLog aims to improve the reasoning phase, but essentially keeps the knowledge compilation phase unchanged, we distinguish two types of benchmarks: those where the bottleneck is reasoning, i.e., propositional formulas are relatively small, but a large amount of reasoning may be required to identify the formulas, and those where the bottleneck is knowledge compilation, i.e., propositional formulas become complex quickly. As a representative of the former, we use a probabilistic version of LUBM [Guo et al., 2011], a setting known to be challenging for ProbLog [Schoenfisch and Stuckenschmidt, 2017;van Bremen et al., 2019]; for the latter type, we use three benchmarks from the ProbLog literature [Fierens et al., 2015;Renkens et al., 2014;Vlasselaer et al., 2016] that essentially are all variations of network connectivity queries: ...
Preprint
State-of-the-art inference approaches in probabilistic logic programming typically start by computing the relevant ground program with respect to the queries of interest, and then use this program for probabilistic inference using knowledge compilation and weighted model counting. We propose an alternative approach that uses efficient Datalog techniques to integrate knowledge compilation with forward reasoning with a non-ground program. This effectively eliminates the grounding bottleneck that so far has prohibited the application of probabilistic logic programming in query answering scenarios over knowledge graphs, while also providing fast approximations on classical benchmarks in the field.
... They have proven that the issues with the high degree difficulties might be solved by the proposed approach. In 2017, Joerg and Heiner [60] have investigated the queries that were gathered from different public SPARQL. The respective analysis has reviewed various quires in practically and has shown its safe section. ...
... Panel modelling [27] Shopping application [30] Ethnographic analysis [31] Ontology based application [34] Transport application [37] Alcohol Misusing [38] Wind energy prediction [39] Socio-economic vulnerability [41] Alien species invasion [42] Human computer interaction [43] Disaster risk [47] Population structuring [50] Radioactive-waste disposal [52] Action recognition [57] Mathematical application [58] Engineering application [59] Query processing [60] Context data Modelling Fig. 4 Analysis on Web application a context data retrieval, b context data modelling Content courtesy of Springer Nature, terms of use apply. Rights reserved. ...
Article
Full-text available
As the data or information gets increased in various applications, it is very much essential to make the retrieval and modeling easier and simple. Number of modeling aspects already exists for this crisis. Yet, context awareness modeling plays a significant role in this. However, there requires some advancement in modeling system with the incorporation of advanced technologies. Hence, this survey intends to formulate a review on the context-aware modeling in two aspects: context data retrieval and context data modeling. Here, the literature analyses on diverse techniques associated with context awareness modeling. It reviews 60 research papers and states the significant analysis. Initially, the analysis depicts various applications that are contributed in different papers. Subsequently, the analysis also focuses on various features such as web application, time series model, intelligence models and performance measure. Moreover, this survey gives the detailed study regarding the chronological review and performance achievements in each contribution. Finally, it extends the various research issues, mainly the adoption of Evolutionary algorithms, which can be useful for the researchers to accomplish further research on context-aware system.
Article
The process of instantiating, or "grounding", a first-order model is a fundamental component of reasoning in logic. It has been widely studied in the context of theorem proving, database theory, and artificial intelligence. Within the relational learning community, the concept of grounding has been expanded to apply to models that use more general templates in the place of first-order logical formulae. In order to perform inference, grounding of these templates is required for instantiating a distribution over possible worlds. However, because of the complex data dependencies stemming from instantiating generalized templates with interconnected data, grounding is often the key computational bottleneck to relational learning. While we motivate our work in the context of relational learning, similar issues arise in probabilistic databases, particularly those that do not make strong tuple independence assumptions. In this paper, we investigate how key techniques from relational database theory can be utilized to improve the computational efficiency of the grounding process. We introduce the notion of collective grounding which treats logical programs not as a collection of independent rules, but instead as a joint set of interdependent workloads that can be shared. We introduce the theoretical concept of collective grounding, the components necessary in a collective grounding system, implementations of these components, and show how to use database theory to speed up these components. We demonstrate collective groundings effectiveness on seven popular datasets, and show up to a 70% reduction in runtime using collective grounding. Our results are fully reproducible and all code, data, and experimental scripts are included.
Article
The role of uncertainty in data management has become more prominent than ever before, especially because of the growing importance of machine learning-driven applications that produce large uncertain databases. A well-known approach to querying such databases is to blend rule-based reasoning with uncertainty. However, techniques proposed so far struggle with large databases. In this paper, we address this problem by presenting a new technique for probabilistic reasoning that exploits Trigger Graphs (TGs) -- a notion recently introduced for the non-probabilistic setting. The intuition is that TGs can effectively store a probabilistic model by avoiding an explicit materialization of the lineage and by grouping together similar derivations of the same fact. Firstly, we show how TGs can be adapted to support the possible world semantics. Then, we describe techniques for efficiently computing a probabilistic model and formally establish the correctness of our approach. We also present an extensive empirical evaluation using a prototype called LTGs. Our comparison against other leading engines shows that LTGs is not only faster, even against approximate reasoning techniques, but can also reason over probabilistic databases that existing engines cannot scale to.
Article
Semantic web consists of the data in the structure manner and query searching methods can access these structured data to provide effective search result. The query recommendation in the semantic web relevance is needed to be improved based on the user input query. Many existing methods are used to improve the query recommendation efficiency using the optimization technique such as Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO). These methods involve in the use of many features which are selected from the user query. This in-turn increases the cost of a query in the semantic web. In this research, the query optimization was carried out by using the statistics method. The statistics based optimization method requires fewer features such as triple pattern and node priority etc., for finding the relevant results. The LUBM dataset contains the semantic queries and this dataset is used to measure the efficiency of the proposed Statistical based optimization method. The SPARQL queries are used to plot the query graph and triple scores are extracted from the graph. The cost value of the triple scores is measured and given as input to the proposed statistics method. The execution time of the statistics based optimization method for the query is 35 ms while the existing method has 48 ms.
Article
Full-text available
This paper presents the application of abstraction methods in creating concepts that allow to describe and solve more complex problems of knowledge representation in semantic networks. In the Semantic Web, knowledge is represented by the attributive language AL. There are many information granules theories formulated in different theories such as set theory, probability theory, possible data sets in the evidence systems, shadowed sets, fuzzy set theory, and rough set theory. In order to equally interpret AL language expressions in different information granules theories within the Semantic Web, it is assumed that AL language expressions are interpreted in the chosen relational system called a granule system. This paper formally describes information granule system and shows the example how to formulate such granule system in the theory of the contextual rough sets [14]which describes inclusions of fuzzy context relations with some error. Defined granule system allows to interpret AL languages. It is shown that there is such subsystem of this granule system that is adequate i.e. it is homomorphic within some ordered set algebra.
Article
Full-text available
Representing uncertain information is crucial for modeling real world domains. In this paper we present a technique for the integration of probabilistic information in Description Logics (DLs) that is based on the distribution semantics for probabilistic logic programs. In the resulting approach, that we called DISPONTE, the axioms of a probabilistic knowledge base (KB) can be annotated with a real number between 0 and 1. A probabilistic knowledge base then defines a probability distribution over regular KBs called worlds and the probability of a given query can be obtained from the joint distribution of the worlds and the query by marginalization. We present the algorithm BUNDLE for computing the probability of queries from DISPONTE KBs. The algorithm exploits an underlying DL reasoner, such as Pellet, that is able to return explanations for queries. The explanations are encoded in a Binary Decision Diagram from which the probability of the query is computed. The experimentation of BUNDLE shows that it can handle probabilistic KBs of realistic size.
Conference Paper
Full-text available
We present LSQ: a Linked Dataset describing SPARQL queries extracted from the logs of public SPARQL endpoints. We argue that LSQ has a variety of uses for the SPARQL research community, be it for example to generate custom benchmarks or conduct analyses of SPARQL adoption. We introduce the LSQ data model used to describe SPARQL query executions as RDF. We then provide details on the four SPARQL endpoint logs that we have RDFised thus far. The resulting dataset contains 73 million triples describing 5.7 million query executions.
Conference Paper
Full-text available
FishBase is an important species data collection produced by the FishBase Information and Research Group Inc (FIN), a not-for-profit NGO with the aim of collecting comprehensive information (from the taxonomic to the ecological) about all the world's finned fish species. FishBase is exposed as a MySQL backed website (supporting a range of canned, although complex queries) and serves over 33 million hits per month. FishDelish is a transformation of FishBase into LinkedData weighing in at 1.38 billion triples. We have ported a substantial number of FishBase SQL queries to FishDelish SPARQL query which form the basis of a new linked data application benchmark (using our derivative of the Berlin SPARQL Benchmark harness). We use this benchmarking framework to compare the performance of the native MySQL application, the Virtuoso RDF triple store, and the Quest OBDA system on a fishbase.org like application.
Article
Full-text available
Recent Semantic Web research has been largely focused on precise ontologies and knowledge representation, leaving only little space for imprecise knowledge such as rules of thumb. The result is a Semantic Web which can answer requests almost perfectly with respect to precision, but provides only a low recall. In this position paper, we envision to address this issue with a stack of Semantic Web technologies that allow imprecise knowledge as an essential ingredient.
Article
Full-text available
Most of the work on query evaluation in probabilistic databases has focused on the simple tuple-independent data model, where tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, tree-decomposition and a variety of approximation algorithms. However, complex data analytics tasks often require complex correlations, and query evaluation then is significantly more expensive, or more restrictive. In this paper, we propose MVDB as a framework both for representing complex correlations and for efficient query evaluation. An MVDB specifies correlations by views, called MarkoViews, on the probabilistic relations and declaring the weights of the view's outputs. An MVDB is a (very large) Markov Logic Network. We make two sets of contributions. First, we show that query evaluation on an MVDB is equivalent to evaluating a Union of Conjunctive Query(UCQ) over a tuple-independent database. The translation is exact (thus allowing the techniques developed for tuple independent databases to be carried over to MVDB), yet it is novel and quite non-obvious (some resulting probabilities may be negative!). This translation in itself though may not lead to much gain since the translated query gets complicated as we try to capture more correlations. Our second contribution is to propose a new query evaluation strategy that exploits offline compilation to speed up online query evaluation. Here we utilize and extend our prior work on compilation of UCQ. We validate experimentally our techniques on a large probabilistic database with MarkoViews inferred from the DBLP data.
Article
Full-text available
This paper presents a system description of Pronto — the first probabilistic Description Logic reasoner capable of processing knowl-edge bases containing about a thousand of probabilistic axioms. We de-scribe the design and architecture of the reasoner with an emphasis on the components that implement algorithms which are crucial for achiev-ing such level of scalability. Finally, we present the results of the experi-mental evaluation of Pronto's performance on series of propositional and non-propositional probabilistic knowledge bases.
Article
Full-text available
The recently introduced series of description logics under the common moniker 'DL-Lite' has attracted attention of the description logic and semantic web communities due to the low computational complexity of inference, on the one hand, and the ability to represent conceptual modeling formalisms, on the other. The main aim of this article is to carry out a thorough and systematic investigation of inference in extensions of the original DL-Lite logics along five axes: by (i) adding the Boolean connectives and (ii) number restrictions to concept constructs, (iii) allowing role hierarchies, (iv) allowing role disjointness, symmetry, asymmetry, reflexivity, irreflexivity and transitivity constraints, and (v) adopting or drop-ping the unique name assumption. We analyze the combined complexity of satisfiability for the resulting logics, as well as the data complexity of instance checking and answering positive existential queries. Our approach is based on embedding DL-Lite logics in suit-able fragments of the one-variable first-order logic, which provides useful insights into their properties and, in particular, computational behavior.
Article
Full-text available
We propose a new family of description logics (DLs), called DL-Lite, specifically tailored to capture basic ontology languages, while keeping low complexity of reasoning. Reasoning here means not only computing subsumption between concepts and checking satisfiability of the whole knowledge base, but also answering complex queries (in particular, unions of conjunctive queries) over the instance level (ABox) of the DL knowledge base. We show that, for the DLs of the DL-Litefamily, the usual DL reasoning tasks are polynomial in the size of the TBox, and query answering is LogSpace in the size of the ABox (i.e., in data complexity). To the best of our knowledge, this is the first result of polynomial-time data complexity for query answering over DL knowledge bases. Notably our logics allow for a separation between TBox and ABox reasoning during query evaluation: the part of the process requiring TBox reasoning is independent of the ABox, and the part of the process requiring access to the ABox can be carried out by an SQL engine, thus taking advantage of the query optimization strategies provided by current database management systems. Since even slight extensions to the logics of the DL-Litefamily make query answering at least NLogSpace in data complexity, thus ruling out the possibility of using on-the-shelf relational technology for query processing, we can conclude that the logics of the DL-Litefamily are the maximal DLs supporting efficient query answering over large amounts of instances.
Conference Paper
Full-text available
We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242,000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent. Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Conference Paper
Full-text available
The USEWOD2011 workshop investigates combinations of usage data with semantics and the web of data. The analysis of usage data may be enhanced using semantic information. Now that more and more explicit knowledge is represented on the Web, the question arises how these semantics can be used to aid large scale web usage analysis and mining. Conversely, usage data analysis can enhance semantic resources as well as Semantic Web applications. Traces of users can be used to evaluate, adapt or personalize Semantic Web applications. Also, new ways of accessing information enabled by the Web of Data imply the need to develop or adapt algorithms, methods, and techniques to analyze and interpret the usage of Web data instead of Web pages. The USEWOD2011 program includes a challenge to the workshop participants: three months before the workshop two datasets consisting of server log files of Linked Open Data sources were released. Participants are invited to come up with interesting analyses, applications, alignments, etc. for these datasets.
Conference Paper
Full-text available
Log-linear description logics are a family of probabilistic logics integrating various concepts and methods from the areas of knowledge representation and reasoning and statistical relational AI. We define the syntax and semantics of log-linear description logics, describe a convenient representation as sets of first-order formulas, and discuss computational and algorithmic aspects of probabilistic queries in the language. The paper concludes with an experimental evaluation of an implementation of a log-linear DL reasoner.
Article
Full-text available
Several real-world applications need to effectively manage and reason about large amounts of data that are inherently uncertain. For instance, perva- sive computing applications must constantly reason about volumes of noisy sensory readings for a variety of reasons, including motion prediction and human behavior modeling. Such probabilistic data analyses require so- phisticated machine-learning tools that can effectively model the complex spatio/temporal correlation patterns present in uncertain sensory data. Un- fortunately, to date, most existing approaches to probabilistic database sys- tems have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures: Probabilistic informa- tion is typically associated with individual data tuples, with only limited or no support for effectively capturing and reasoning about complex data correlations. In this paper, we introduce BAYESSTORE, a novel probabilis- tic data management architecture built on the principle of handling statis- tical models and probabilistic inference tools as first-class citizens of the database system. Adopting a machine-learning view, BAYESSTORE em- ploys concise statistical relational models to effectively encode the correla- tion patterns between uncertain data, and promotes probabilistic inference and statistical model manipulation as part of the standard DBMS opera- tor repertoire to support efficient and sound query processing. We present BAYESSTORE's uncertainty model based on a novel, first-order statistical model, and we redefine traditional query processing operators, to manip- ulate the data and the probabilistic models of the database in an efficient manner. Finally, we validate our approach, by demonstrating the value of exploiting data correlations during query processing, and by evaluating a number of optimizations which significantly accelerate query processing.
Article
Full-text available
The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems and are used within enterprise and open Web settings. As SPARQL is taken up by the community, there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. Such systems include native RDF stores as well as systems that rewrite SPARQL queries to SQL queries against non-RDF relational databases. This article introduces the Berlin SPARQL Benchmark (BSBM) for comparing the performance of native RDF stores with the performance of SPARQL-to-SQL rewriters across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. The benchmark query mix emulates the search and navigation pattern of a consumer looking for a product. The article discusses the design of the BSBM benchmark and presents the results of a benchmark experiment comparing the performance of four popular RDF stores (Sesame, Virtuoso, Jena TDB, and Jena SDB) with the performance of two SPARQL-to-SQL rewriters (D2R Server and Virtuoso RDF Views) as well as the performance of two relational database management systems (MySQL and Virtuoso RDBMS).
Article
Full-text available
this paper, we are interested in an extension capable of handling probabilistic knowledge.
Article
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL2QL and RDFS ontologies), and its support for all major relational databases.
Conference Paper
Ontology-based Data Access has intensively been studied as a very relevant problem in connection with semantic web data. Often it is assumed, that the accessed data behaves like a classical database, i.e. it is known which facts hold for certain. Many Web applications, especially those involving information extraction from text, have to deal with uncertainty about the truth of information. In this paper, we introduce an implementation and a benchmark of such a system on top of relational databases. Furthermore, we propose a novel benchmark for systems handling large probabilistic ontologies. We describe the benchmark design and show its characteristics based on the evaluation of our implementation.
Conference Paper
Over the past decade, the two research areas of probabilistic databases and probabilistic programming have intensively studied the problem of making structured probabilistic inference scalable, but—so far—both areas developed almost independently of one another. While probabilistic databases have focused on describing tractable query classes based on the structure of query plans and data lineage, probabilistic programming has contributed sophisticated inference techniques based on knowledge compilation and lifted (first-order) inference. Both fields have developed their own variants of—both exact and approximate—top-k algorithms for query evaluation, and both investigate query optimization techniques known from SQL, Datalog, and Prolog, which all calls for a more intensive study of the commonalities and integration of the two fields. Moreover, we believe that natural-language processing and information extraction will remain a driving factor and in fact a longstanding challenge for developing expressive representation models which can be combined with structured probabilistic inference—also for the next decades to come.
Conference Paper
There are several approaches to implementing query answering over instance data in the presence of an ontology that target conventional relational database systems (SQL databases) as their back-end. They all share the limitation that, for ontologies formulated in OWL2 QL or versions of the description logic DL-Lite that admit both role hierarchies and inverse roles, it seems impossible to avoid an exponential blowup of the query (and sometimes this is even provable). We consider the combined approach and propose to replace its query rewriting part with a filtering technique. This is natural from an implementation perspective and allows us to handle both inverse roles and role hierarchies without an exponential blowup. We also carry out an experimental evaluation that demonstrates the scalability of this approach.
Conference Paper
We propose a framework for querying probabilistic instance data in the presence of an OWL2 QL ontology, arguing that the interplay of probabilities and ontologies is fruitful in many applications such as managing data that was extracted from the web. The prime inference problem is computing answer probabilities, and it can be implemented using standard probabilistic database systems. We establish a PTime vs. #P dichotomy for the data complexity of this problem by lifting a corresponding result from probabilistic databases. We also demonstrate that query rewriting (backwards chaining) is an important tool for our framework, show that non-existence of a rewriting into first-order logic implies #P-hardness, and briefly discuss approximation of answer probabilities.
Conference Paper
Representing uncertain information is very important for modeling real world domains. Recently, the DISPONTE semantics has been proposed for probabilistic description logics. In DISPONTE, the axioms of a knowledge base can be annotated with a set of variables and a real number between 0 and 1. This real number represents the probability of each version of the axiom in which the specified variables are instantiated. In this paper we present the algorithm BUNDLE for computing the probability of queries from DISPONTE knowledge bases that follow the \(\mathcal{ALC}\) semantics. BUNDLE exploits an underlying DL reasoner, such as Pellet, that is able to return explanations for queries. The explanations are encoded in a Binary Decision Diagram from which the probability of the query is computed. The experiments performed by applying BUNDLE to probabilistic knowledge bases show that it can handle ontologies of realistic size and is competitive with the system PRONTO for the probabilistic description logic P-\(\mathcal{SHIQ}\)(D).
Article
We study the complexity of computing a query on a probabilistic database. We consider unions of conjunctive queries, UCQ, which are equivalent to positive, existential First Order Logic sentences, and also to nonrecursive datalog programs. The tuples in the database are independent random events. We prove the following dichotomy theorem. For every UCQ query, either its probability can be computed in polynomial time in the size of the database, or is #P-hard. Our result also has applications to the problem of computing the probability of positive, Boolean expressions, and establishes a dichotomy for such classes based on their structure. For the tractable case, we give a very simple algorithm that alternates between two steps: applying the inclusion/exclusion formula, and removing one existential variable. A key and novel feature of this algorithm is that it avoids computing terms that cancel out in the inclusion/exclusion formula, in other words it only computes those terms whose Mobius function in an appropriate lattice is nonzero. We show that this simple feature is a key ingredient needed to ensure completeness. For the hardness proof, we give a reduction from the counting problem for positive, partitioned 2CNF, which is known to be #P-complete. The hardness proof is nontrivial, and combines techniques from logic, classical algebra, and analysis.
Article
Although the majority of evidence analysis in DeepQA is focused on unstructured information (e.g., natural-language documents), several components in the DeepQA system use structured data (e.g., databases, knowledge bases, and ontologies) to generate potential candidate answers or find additional evidence. Structured data analytics are a natural complement to unstructured methods in that they typically cover a narrower range of questions but are more precise within that range. Moreover, structured data that has formal semantics is amenable to logical reasoning techniques that can be used to provide implicit evidence. The DeepQA system does not contain a single monolithic structured data module; instead, it allows for different components to use and integrate structured and semistructured data, with varying degrees of expressivity and formal specificity. This paper is a survey of DeepQA components that use structured data. Areas in which evidence from structured sources has the most impact include typing of answers, application of geospatial and temporal constraints, and the use of formally encoded a priori knowledge of commonly appearing entity types such as countries and U.S. presidents. We present details of appropriate components and demonstrate their end-to-end impact on the IBM Watson™ system.
Article
We present statistics on real world SPARQL queries that may be of interest for building SPARQL query processing en-gines and benchmarks. In particular, we analyze the syntac-tical structure of queries in a log of about 3 million queries, harvested from the DBPedia SPARQL endpoint. Although a sizable portion of the log is shown to consist of so-called con-junctive SPARQL queries, non-conjunctive queries that use SPARQL's union or optional operators are more than sub-stantial. It is known, however, that query evaluation quickly becomes hard for queries including the non-conjunctive op-erators union or optional. We therefore drill deeper into the syntactical structure of the queries that are not conjunc-tive and show that in 50% of the cases, these queries satisfy certain structural restrictions that imply tractable evalua-tion in theory. We hope that the identification of these restrictions can aid in the future development of practical heuristics for processing non-conjunctive SPARQL queries.
Article
We describe our method for benchmarking Semantic Web knowledge base systems with respect to use in large OWL applications. We present the Lehigh University Benchmark (LUBM) as an example of how to design such benchmarks. The LUBM features an ontology for the university domain, synthetic OWL data scalable to an arbitrary size, 14 extensional queries representing a variety of properties, and several performance metrics. The LUBM can be used to evaluate systems with different reasoning capabilities and storage mechanisms. We demonstrate this with an evaluation of two memory-based systems and two systems with persistent storage.
Article
Ontologies play a crucial role in the development of the Semantic Web as a means for defining shared terms in web resources. They are formulated in web ontology languages, which are based on expressive description logics. Significant research efforts in the semantic web community are recently directed towards representing and reasoning with uncertainty and vagueness in ontologies for the Semantic Web. In this paper, we give an overview of approaches in this context to managing probabilistic uncertainty, possibilistic uncertainty, and vagueness in expressive description logics for the Semantic Web.
Article
Inspired by game theory representations, Bayesian networks, influence diagrams, structured Markov decision process models, logic programming, and work in dynamical systems, the independent choice logic (ICL) is a semantic framework that allows for independent choices (made by various agents, including nature) and a logic program that gives the consequence of choices. This representation can be used as a specification for agents that act in a world, make observations of that world and have memory, as well as a modelling tool for dynamic environments with uncertainty. The rules specify the consequences of an action, what can be sensed and the utility of outcomes. This paper presents a possible-worlds semantics for ICL, and shows how to embed influence diagrams, structured Markov decision processes, and both the strategic (normal) form and extensive (game-tree) form of games within the ICL. It is argued that the ICL provides a natural and concise representation for multi-agent decision-making under uncertainty that allows for the representation of structured probability tables, the dynamic construction of networks (through the use of logical variables) and a way to handle uncertainty and decisions in a logical representation.
Conference Paper
This paper describes the first steps towards developing a methodology for testing and evaluating the performance of reasoners for the probabilistic description logic P- SHIQ(D){\ensuremath{\mathcal{SHIQ}}(D)} . Since it is a new formalism for handling uncertainty in DL ontologies, no such methodology has been proposed. There are no sufficiently large probabilistic ontologies to be used as test suites. In addition, since the reasoning services in P- SHIQ(D){\ensuremath{\mathcal{SHIQ}}(D)} are mostly query oriented, there is no single problem (like classification or realization in classical DL) that could be an obvious candidate for benchmarking. All these issues make it hard to evaluate the performance of reasoners, reveal the complexity bottlenecks and assess the value of optimization strategies. This paper addresses these important problems by making the following contributions: First, it describes a probabilistic ontology that has been developed for the real-life domain of breast cancer which poses significant challenges for the state-of-art P- SHIQ(D){\ensuremath{\mathcal{SHIQ}}(D)} reasoners. Second, it explains a systematic approach to generating a series of probabilistic reasoning problems that enable evaluation of the reasoning performance and shed light on what makes reasoning in P- SHIQ(D){\ensuremath{\mathcal{SHIQ}}(D)} hard in practice. Finally, the paper presents an optimized algorithm for the non-monotonic entailment. Its positive impact on performance is demonstrated using our evaluation methodology.
Conference Paper
We introduce ProbLog, a probabilistic extension of Prolog. A ProbLog program defines a distribution over logic programs by specifying for each clause the probability that it belongs to a randomly sam- pled program, and these probabilities are mutually independent. The semantics of ProbLog is then de- fined by the success probability of a query, which corresponds to the probability that the query suc- ceeds in a randomly sampled program. The key contribution of this paper is the introduction of an effective solver for computing success probabili- ties. It essentially combines SLD-resolution with methods for computing the probability of Boolean formulae. Our implementation further employs an approximation algorithm that combines iterative deepening with binary decision diagrams. We re- port on experiments in the context of discovering links in real biological networks, a demonstration of the practical usefulness of the approach.
Article
We propose a simple approach to combining first-order logic and probabilistic graphical models in a single representation. A Markov logic network (MLN) is a first-order knowledge base with a weight attached to each formula (or clause). Together with a set of constants representing objects in the domain, it specifies a ground Markov network containing one feature for each possible grounding of a first-order formula in the KB, with the corresponding weight. Inference in MLNs is performed by MCMC over the minimal subset of the ground network required for answering the query. Weights are efficiently learned from relational databases by iteratively optimizing a pseudo-likelihood measure. Optionally, additional clauses are learned using inductive logic programming techniques. Experiments with a real-world database and knowledge base in a university domain illustrate the promise of this approach.
Article
When a joint distribution PF is given to a set F of facts in a logic program DB = F U R where R is a set of rules, we can further extend it to a joint distribution PDB over the set of possible least models of DB. We then define the semantics of DB with the associated distribution PF as PDB, and call it distribution semantics.
An empirical study of real-world SPARQL queries
  • Gallego
Analyzing real-world SPARQL queries in the light of probabilistic data
  • Schoenfisch