Article

N-Quads: Extending N-Triples with Context

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This type of query can be achieved with either FROM NAMED or GRAPH clauses. There are two widely accepted formats to represent named-graphs in RDF: N-Quads [6] and TriG [2]. Most quad stores are compatible with these two formats. ...
... The average number of hits per month grew from 3M hits in 2010 to 122M hits in 2011. 6 Our commitment to serve this growing demand is one of the primary motivations to undertake the project of migrating towards a more scalable infrastructure. ...
Article
Full-text available
Quad stores have a number of features that make them very attractive for Semantic Web applications. Quad stores use Named Graphs to give contextual information to RDF graphs. Developers can use these contexts to incorporate meta-data such as provenance and ver-sioning. The OWL language specification provides the owl:imports con-struct as one of the key ways to reuse the ontologies on the Semantic Web. When one ontology imports another through the owl:imports statement, all axioms from the imported ontology are brought to the source ontol-ogy. When implementing an online ontology repository, we have explored a number of different approaches to reflect these imports and contextual information in a quad store. These approaches have a direct impact on storage requirements, query articulation, and reusability. This paper describes different models to represent contextual informa-tion, ontology imports and versioning. We have performed the experi-ments in the context of storing ontologies from the BioPortal ontology repository. This repository has 304 ontologies, with multiple versions for many of them. We extract metrics on the levels of imports, and the storage and performance requirements and discuss the trade-offs among query articulation, reusability and scalability.
... Je propose que les documents soient agrégés en l'état dans un lac de données (data lake), ou lac de documents pour l'utilisation que je propose. Les données quant à elles sont stockées dans un triple store formaté en RDF, mis en oeuvre par SPARQL pour structurer des données sous la forme de NQUADS [CHH08]. L'utilisation de NQUADS plutôt que des triplets classiques permet la mise en place d'un mécanisme évitant la duplication des données. ...
Thesis
Les travaux patrimoniaux connaissent de multiples déclinaisons : étude en vue d’une candidature, exposition en musée, analyse ethnologique, archéologique, historiographique.. La nature des activités dépend du type de patrimoine considéré, des intentions, de la documentation à disposition, etc. Le plus souvent, des travaux complémentaires s’agrègent, permettant de combiner les points de vue, hypothèses et informations. Les humanités numériques, c’est à dire les activités d’étude, en sciences humaines au sens large, ayant recours à l’informatique calculatoire, se développent depuis les années 60. Les travaux patrimoniaux ne sont pas en reste, mais la quantité et la diversité, voire l’hétérogénéité des informations, combinées aux critères déontologiques du travail patrimonial compliquent le développement d’outillage pertinent. Dans une première partie, une réflexion portant sur les caractéristiques des travaux patrimoniaux et sur les enjeux du travail de modélisation en lien étroit avec la documentation est présentée. A partir de cette analyse, un cahier des charges pour la production d’un outil est établi permettant de faire face aux enjeux prioritaires. L’opposition entre la construction du sens qui entraîne la patrimonialisation, et la rupture du sens inhérente au numérique, est discutée, ainsi que le besoin de transparence dans les pratiques de modélisation. Les critères d’intégrité et d’authenticité des biens patrimoniaux, qui guident en partie nos apports, seront aussi affirmés dans leurs dimensions dynamiques. La mise en application, par le cas d’étude de l’Observatoire du Pic du Midi ainsi que celui de la série des cercles méridiens Gautier, permet de démontrer les propositions et d’en éprouver la pertinence et les limites.
... They are commonly referred to as entityattribute-value models or subject-predicate-object models (e.g., semantic triples [24], N-Triples [24]). 3-tables have also been employed in the domain of knowledge representation and are referred to as context-subject-predicate-object models (e.g., N-Quads [12]). Note also that the RL data model is closely related to the nested table data model -a datamodel proposed as a canonical model for data definition and manipulation of forms and form-based documents [23] -as well as other nested relational data models [17]. ...
... Je propose que les documents soient agrégés en l'état dans un lac de données (data lake), ou lac de documents pour l'utilisation que je propose. Les données quant à elles sont stockées dans un triple store formaté en RDF, mis en oeuvre par SPARQL pour structurer des données sous la forme de NQUADS [CHH08]. L'utilisation de NQUADS plutôt que des triplets classiques permet la mise en place d'un mécanisme évitant la duplication des données. ...
Thesis
Les travaux patrimoniaux connaissent de multiples déclinaisons : étude en vue d’une candidature, exposition en musée, analyse ethnologique, archéologique, historiographique..La nature des activités dépend du type de patrimoine considéré, des intentions, de la documentation à disposition, etc. Le plus souvent, des travaux complémentaires s’agrègent, permettant de combiner les points de vue, hypothèses et informations. Les humanités numériques, c’est à dire les activités d’étude, en sciences humaines au sens large, ayant recours à l’informatique calculatoire, se développent depuis les années 60. Les travaux patrimoniaux ne sont pas en reste, mais la quantité et la diversité, voire l’hétérogénéité des informations, combinées aux critères déontologiques du travail patrimonial compliquent le développement d’outillage pertinent. Dans une première partie, une réflexion portant sur les caractéristiques des travaux patrimoniaux et sur les enjeux du travail de modélisation en lien étroit avec la documentation est présentée. A partir de cette analyse, un cahier des charges pour la production d’un outil est établi permettant de faire face aux enjeux prioritaires. L’opposition entre la construction du sens qui entraîne la patrimonialisation, et la rupture du sens inhérente au numérique, est discutée, ainsi que le besoin de transparence dans les pratiques de modélisation. Les critères d’intégrité et d’authenticité des biens patrimoniaux, qui guident en partie nos apports, seront aussi affirmés dans leurs dimensions dynamiques. La mise en application, par le cas d’étude de l’Observatoire du Pic du Midi ainsi que celui de la série des cercles méridiens Gautier, permet de démontrer les propositions et d’en éprouver la pertinence et les limites.
... This structure can be interpreted as: "object o stands in relationship p with subject s". RDF can be represented in different serialization formats including RDF/XML [16], JSON for Linked Data (JSON-LD) [17], N-Triples [18], N-Quads [19], Turtle [20], RDFa [21], Notation 3 (N3) [22] and Entity Notation (EN) [23] [24]. ...
Conference Paper
Full-text available
Semantics associates meaning with Internet of Things (IoT) data and facilitates the development of intelligent IoT applications and services. However, the big volume of the data generated by IoT devices and resource limitations of these devices have given rise to challenges for applying semantic technologies. In this article, we present Cloud and edge based IoT architec- tures for semantic reasoning. We report three experiments that demonstrate how edge computing can facilitate IoT systems in terms of data transfer and semantic reasoning. We also analyze how distributing reasoning tasks between the Cloud and edge devices affects system performance.
... Mining frequent structural patterns from a collection of graphs, usually referred to as frequent subgraph mining (FSM), has found much research interest in the last two decades, for example, to identify significant patterns from chemical or biological structures and protein interaction networks [13]. Besides these typical application domains, graph collections are generally a natural representation of partitioned network data such as knowledge graphs [7], business process executions [24] or communities in a social network [14]. We identified two requirements for FSM on such data that are not satisfied by existing approaches: First, such data typically describes directed multigraphs, i.e., the direction of an edge has a semantic meaning and there may exist multiple edges between the same pair of vertices. ...
Article
Full-text available
Transactional frequent subgraph mining identifies frequent subgraphs in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow systems such as Apache Spark or Apache Flink. It determines the complete set of frequent subgraphs from arbitrary string-labeled directed multigraphs as they occur in social, business and knowledge networks. DIMSpan is optimized to runtime and minimal network traffic but memory-aware. An extensive performance evaluation on large graph collections shows the scalability of DIMSpan and the effectiveness of its pruning and optimization techniques.
... The data can be downloaded from the Web Data Commons website 9 in NQuad format (Cyganiak, Harth, & Hogan, 2008), which means that there is an extra, fourth column added to extend the triple with context. In that particular case, the context represents the URL to the website on which the triple was found, so we could call it the data provenance column. ...
Chapter
Full-text available
It has been almost four years now since the world's leading search engine operators, Bing, Google, Yahoo! and Yandex, decided to start working on an initiative to enrich web pages with structured data, known as schema.org. Since then, many web masters and those responsible for web pages started adapting this technology to enrich websites with semantic information. This paper analyzes parts of the structured data in the largest available open to the public web crawl, the Common Crawl, to find out how the hotel branch is using schema.org. On the use case of schema.org/Hotel, this paper studies who uses it, how it is applied and whether or not the classes and properties of the vocabulary are used in the syntactically and semantically correct way. Further, this paper will compare the usage based on numbers of 2013 and 2014 to find out whether or not an increase in usage can be noted. We observe a wide and growing distribution of schema.org, but also a large variety of erroneous and restricted usage of schema.org within the data set, which makes the data hard to use for real-life applications. When it comes to geographical comparison, the outcome shows that the United States are far in the lead with annotation of hotels with schema.org and Europe still has work to do to catch up.
... If the patch tracks only the triple without the graph reference, it becomes not invertible, as the graph to delete the triple from can not be determined. That means, to support all possible RDF datasets, a VCS needs to track the corresponding RDF graph of a triple, known as quad [8]. ...
Conference Paper
Full-text available
Coherent and consistent tracking of provenance data and in particular update history information is a crucial building block for any serious information system architecture. Version Control Systems can be a part of such an architecture enabling users to query and manipulate versioning information as well as content revisions. In this paper, we introduce an RDF versioning approach as a foundation for a full featured RDF Version Control System. We argue that such a system needs support for all concepts of the RDF specification including support for RDF datasets and blank nodes. Furthermore, we placed special emphasis on the protection against unperceived history manipulation by hashing the resulting patches. In addition to the conceptual analysis and an RDF vocabulary for representing versioning information, we present a mature implementation which captures versioning information for changes to arbitrary RDF datasets.
... To ease the manipulation of Graph Stores, we use the N-Quads [12] notation. Instead of a set of sets, there is a set of quads (s, p, o, gn) where s, p, o corresponds to the subject, predicate and object of an RDF-Triple and gn to the graph's name IRI, with the δ symbol denoting the Default Graph. ...
Article
Full-text available
Linked Open Data cloud (LOD) is essentially read-only, re- straining the possibility of collaborative knowledge construction. To sup- port collaboration, we need to make the LOD writable. In this paper, we propose a vision for a writable linked data where each LOD participant can define updatable materialized views from data hosted by other par- ticipants. Consequently, building a writable LOD can be reduced to the problem of SPARQL self-maintenance of Select-Union recursive mate- rialized views. We propose TM-Graph, an RDF-Graph annotated with elements of a specialized provenance semiring to maintain consistency of these views and we analyze complexity in space and traffic.
... A large variety of graph data models has been proposed in the last three decades [11, 13, 23, 29], but only two found considerable attention in graph data management and processing: the resource description framework (RDF) [34] and the property graph model (PGM) [49, 50]. In contrast to the PGM, RDF has some support for multiple graphs by the notion of n-quads [21]; its standardized query language SPARQL [28] also allows queries on multiple graphs. However, the RDF data representation by triples is very fine-grained and there is no uniform way to represent richer concepts of the PGM in RDF [57] so that the distinction of relationships , e.g., (vertex1,edge1,vertex2), type lables, e.g., (edge1,type,knows), and properties, e.g., (vertex1,name, Alice) has to be done at the application level. ...
Article
Full-text available
Many Big Data applications in business and science require the management and analysis of huge amounts of graph data. Previous approaches for graph analytics such as graph databases and parallel graph processing systems (e.g., Pregel) either lack sufficient scalability or flexibility and expressiveness. We are therefore developing a new end-to-end approach for graph data management and analysis based on the Hadoop ecosystem, called Gradoop (Graph analytics on Hadoop). Gradoop is designed around the so-called Extended Property Graph Data Model (EPGM) supporting semantically rich, schema-free graph data within many distinct graphs. A set of high-level operators is provided for analyzing both single graphs and collections of graphs. Based on these operators, we propose a domain-specific language to define analytical workflows. The Gradoop graph store is currently utilizing HBase for distributed storage of graph data in Hadoop clusters. An initial version of Gradoop has been used to analyze graph data for business intelligence and social network analysis.
... The document is parsed and returned in a common format to the Spider. We use N- Quads [4] as the inner format. The N-Triples is extended with the information about the context (the source of the information represented by the triple). ...
Article
Full-text available
In this paper, we compare various approaches to semantic web data crawling. We introduce our crawling framework, which enables us to organize and clean the data before they are presented to the end user or used as a knowledge base. We present methods of semantic data cleaning in order to keep the knowledge base consistent. We used the proposed framework to build a knowledge base containing data about persons crawled from semantic web data sources. In this paper we present the results of the crawling process.
... We choose the dataset from the Semantic Web Challenge 2010 12 , called Billion Triples, which contains (∼3.2 billion statements). The Data is collected from Sindice, Swoogle and others, given in N-Quads [23] format. We select the four hundred largest datasets by grouping the triples according to the host of the fourth 12 http://challenge.semanticweb.org/ ...
Article
Full-text available
The current Web of Data is producing increasingly large RDF datasets. Massive publication efforts of RDF data driven by initiatives like the Linked Open Data movement, and the need to exchange large datasets has unveiled the drawbacks of traditional RDF representations, inspired and designed by a document-centric and human-readable Web. Among the main problems are high levels of verbosity/redundancy and weak machine-processable capabilities in the description of these datasets. This scenario calls for efficient formats for publication and exchange. This article presents a binary RDF representation addressing these issues. Based on a set of metrics that characterizes the skewed structure of real-world RDF data, we develop a proposal of an RDF representation that modularly partitions and efficiently represents three components of RDF datasets: Header information, a Dictionary, and the actual Triples structure (thus called HDT). Our experimental evaluation shows that datasets in HDT format can be compacted by more than fifteen times as compared to current naive representations, improving both parsing and processing while keeping a consistent publication scheme. Specific compression techniques over HDT further improve these compression rates and prove to outperform existing compression solutions for efficient RDF exchange.
... kit.edu/projects/btc-2012/rest/). This dataset is encoded in NQuads format [39] and includes three data files that range in size from 409.99 MB to 2.69 GB. Figure 5 shows the conversion results of 4.34 GB source dataset in the HPC-JNU cluster system. As an explanatory scripting language, the Perl language has poor IO disk performance. ...
Article
Full-text available
Semantic technology plays a key role in various domains, from conversation understanding to algorithm analysis. As the most efficient semantic tool, ontology can represent, process and manage the widespread knowledge. Nowadays, many researchers use ontology to collect and organize data's semantic information in order to maximize research productivity. In this paper, we firstly describe our work on the development of a remote sensing data ontology, with a primary focus on semantic fusion-driven research for big data. Our ontology is made up of 1,264 concepts and 2,030 semantic relationships. However, the growth of big data is straining the capacities of current semantic fusion and reasoning practices. Considering the massive parallelism of DNA strands, we propose a novel DNA-based semantic fusion model. In this model, a parallel strategy is developed to encode the semantic information in DNA for a large volume of remote sensing data. The semantic information is read in a parallel and bit-wise manner and an individual bit is converted to a base. By doing so, a considerable amount of conversion time can be saved, i.e., the cluster-based multi-processes program can reduce the conversion time from 81,536 seconds to 4,937 seconds for 4.34 GB source data files. Moreover, the size of result file recording DNA sequences is 54.51 GB for parallel C program compared with 57.89 GB for sequential Perl. This shows that our parallel method can also reduce the DNA synthesis cost. In addition, data types are encoded in our model, which is a basis for building type system in our future DNA computer. Finally, we describe theoretically an algorithm for DNA-based semantic fusion. This algorithm enables the process of integration of the knowledge from disparate remote sensing data sources into a consistent, accurate, and complete representation. This process depends solely on ligation reaction and screening operations instead of the ontology.
... Among the most popular triple serialization formats only RDF/JSON (informal), RDFa [17] and Notation 3 [18] are fully IRI compatible. N-Triples [19] and N-Quads [20] do not support IRIs at all, since they use 7-bit US-ASCII character encoding. Turtle [21] and RDF/XML [22] provide partial IRI support as their grammar definition is not fully mapped to the IRI grammar [15, page 7]. ...
Article
This paper describes the deployment of the Greek DBpedia and the contribution to the DBpedia information extraction framework with regard to internationalization (I18n) and multilingual support. I18n filters are proposed as pluggable components in order to address issues when extracting knowledge from non-English Wikipedia editions. We report on our strategy for supporting the International Resource Identifier (IRI) and introduce two new extractors to complement the I18n filters. Additionally, the paper discusses the definition of Transparent Content Negotiation (TCN) rules for IRIs to address de-referencing and IRI serialization problems. The aim of this research is to establish best practices (complemented by software) to allow the DBpedia community to easily generate, maintain and properly interlink language-specific DBpedia editions. Furthermore, these best practices can be applied for the publication of Linked Data in non-Latin languages in general.
... Different from this approach, MWeb restricts context logics (there called constituent rule bases) and does not consider converting between constants in different constituent rule bases. Another approach for extending RDF and SPARQL with a notion of context is the N-Quads [6] proposal. For provenance, RDF triples are extended to quadruples containing an additional identifier marking the origin of the RDF triple. ...
Conference Paper
Full-text available
Multi-Context Systems (MCSs) are an expressive framework for interlinking heterogeneous knowledge systems, called contexts. Possible contexts are ontologies, relational databases, logic programs, RDF triplestores, etc. MCSs contain bridge rules to specify knowledge exchange between contexts. We extend the MCS formalism and propose SPARQL-MCS where knowledge exchange is specified in the style of SPARQL CONSTRUCT queries. Different from previous approaches to variables in MCSs, we do not impose any restrictions on contexts. To achieve this, we introduce a general approach for variable substitutions in heterogeneous systems. We define syntax and semantics of SPARQL-MCS and investigate fixpoint evaluation of monotonic MCSs.
... Our definition of RDF streams extends RDF in the same way as the stream type in CQL extends the relation type. Named graphs [22] and N-Quads [23], a format that extends N-Triples with context, can be both adopted as a concrete serialization for RDF streams. For our experiments we adopt N-Quads and we use as context the timestamp encoded as a RDF literal of type xsd:dateTime. ...
Conference Paper
Full-text available
Social semantic data are becoming a reality, but apparently their streaming nature has been ignored so far. Streams, being unbounded sequences of time-varying data elements, should not be treated as persistent data to be stored “forever” and queried on demand, but rather as transient data to be consumed on the fly by queries which are registered once and for all and keep analyzing such streams, producing answers triggered by the streaming data and not by explicit invocation. In this paper, we propose an approach to continuous queries and real-time analysis of social semantic data with C-SPARQL, an extension of SPARQL for querying RDF streams.
Conference Paper
Transactional frequent subgraph mining identifies frequent structural patterns in a collection of graphs. This research problem has wide applicability and increasingly requires higher scalability over single machine solutions to address the needs of Big Data use cases. We introduce DIMSpan, an advanced approach to frequent subgraph mining that utilizes the features provided by distributed in-memory dataflow systems such as Apache Flink or Apache Spark. It determines the complete set of frequent subgraphs from arbitrary string-labeled directed multigraphs as they occur in social, business and knowledge networks. DIMSpan is optimized to runtime and minimal network traffic but memory-aware. An extensive performance evaluation on large graph collections shows the scalability of DIMSpan and the effectiveness of its optimization techniques.
Thesis
Full-text available
It is recognised that nowadays, users interact with large amounts of data that exist in disparate forms, and are stored under different settings. Moreover, it is true that the amount of structured and un-structured data outside a single well organised data management system is expanding rapidly. To address the recent challenges of managing large amounts of potentially distributed data, the vision of a dataspace was introduced. This data management paradigm aims at reducing the complexity behind the challenges of integrating heterogeneous data sources.Recently, efforts by the Linked Data (LD) community gave rise to a Web of Data (WoD) that interweaves with the current Web of documents in a way that it is useful for data consumption by both humans and computational agents. On the WoD, datasets are structured under a common data model and published as Web resources following a simple set of guidelines that enables them to be linked with other pieces of data, as well as, to be annotated with useful meta data that help determine their semantics. The WoD is an evolving open ecosystem including specialist publishers as well as community efforts aiming at re-publishing isolated databases as LD on the WoD, and annotating them with meta data.The WoD raises new opportunities and challenges. However, currently it mostly relies on manual efforts for integrating the large amounts of heterogeneous data sources on the WoD. This dissertation makes the case that several techniques from the dataspaces research area (aiming at on-demand integration of data sources in a pay-as-you-go fashion) can support the integration of heterogeneous WoD sources. In so doing, this dissertation explores the opportunities and identifies the challenges of adapting existing pay-as-you-go data integration techniques in the context of LD. More specifically, this dissertation makes the following contributions: (1) a case-study for identifying the challenges when existing pay-as-you-go data integration techniques are applied in a setting where data sources are LD; (2) a methodology that deals with the ''schema-less'' nature of LD sources by automatically inferring a conceptual structure from a given RDF graph thus enabling downstream tasks, such as the identification of matches and the derivation of mappings, which are, both, essential for the automatic bootstrapping of a dataspace; and (3) a well-defined, principled methodology that builds on a Bayesian inference technique for reasoning under uncertainty to improve pay-as-you-go integration. Although the developed methodology is generic in being able to reason with different hypothesis, its effectiveness has only been explored on reducing the uncertain decisions made by string-based matchers during the matching stage of a dataspace system.
Article
The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triplestores. We present methods extending a native RDF store to efficiently handle the storage, tracking, and querying of provenance in RDF data. We describe a reliable and understandable specification of the way results were derived from the data and how particular pieces of data were combined to answer a query. Subsequently, we present techniques to tailor queries with provenance data. We empirically evaluate the presented methods and show that the overhead of storing and tracking provenance is acceptable. Finally, we show that tailoring a query with provenance information can also significantly improve the performance of query execution.
Conference Paper
Wikidata is the central data management platform of Wikipedia. By the efforts of thousands of volunteers, the project has produced a large, open knowledge base with many interesting applications. The data is highly interlinked and connected to many other datasets, but it is also very rich, complex, and not available in RDF. To address this issue, we introduce new RDF exports that connect Wikidata to the Linked Data Web. We explain the data model of Wikidata and discuss its encoding in RDF. Moreover, we introduce several partial exports that provide more selective or simplified views on the data. This includes a class hierarchy and several other types of ontological axioms that we extract from the site. All datasets we discuss here are freely available online and updated regularly.
Article
Ranking information resources is a task that usually happens within more complex workflows and that typically occurs in any form of information retrieval, being commonly implemented by Web search engines. By filtering and rating data, ranking strategies guide the navigation of users when exploring large volumes of information items. There exist a considerable number of ranking algorithms that follow different approaches focusing on different aspects of the complex nature of the problem, and reflecting the variety of strategies that are possible to apply. With the growth of the web of linked data, a new problem space for ranking algorithms has emerged, as the nature of the information items to be ranked is very different from the case of Web pages. As a consequence, existing ranking algorithms have been adapted to the case of Linked Data and some specific strategies have started to be proposed and implemented. Researchers and organizations deploying Linked Data solutions thus require an understanding of the applicability, characteristics and state of evaluation of ranking strategies and algorithms as applied to Linked Data. We present a classification that formalizes and contextualizes under a common terminology the problem of ranking Linked Data. In addition, an analysis and contrast of the similarities, differences and applicability of the different approaches is provided. We aim this work to be useful when comparing different approaches to ranking Linked Data and when implementing new algorithms.
Article
Full-text available
In order to attain the coveted information superiority in NATO Network Enabled Capability (NNEC), the challenge of integrating information from different sources in this highly dynamic environment needs to be solved. In this paper, we propose to address this challenge by using a system of lightweight cooperative hybrid agents that rely on Semantic Web technologies. The primary objective when conducting operations according to NATO Network Enabled Capability (NNEC) is to attain information superiority. NNEC is based on an idea of a common information space through which the participating information systems supply information for others to utilize, and retrieve the information needed according to their role (Buckman 2005). In order to realize this idea, the challenge of integrating information from heterogeneous sources in a
Conference Paper
Traditionally, Linked Data query engines execute SPARQL queries over a materialised repository which on the one hand, guarantees fast query answering but on the other hand requires time and resource consuming preprocessing steps. In addition, the materialised repositories have to deal with the ongoing challenge of maintaining the index which is --- given the size of the Web --- practically unfeasible. Thus, the results for a given SPARQL query are potentially out-dated. Recent approaches address the result freshness problem by answering a given query directly over dereferenced query relevant Web documents. Our work investigate the problem of an efficient selection of query relevant sources under this context. As a part of query optimization, source selection tries to estimate the minimum number of sources accessed in order to answer a query. We propose to summarize and index sources based on frequently appearing query graph patterns mined from query logs. We verify the applicability of our approach and empirically show that our approach significantly reduces the number of relevant sources estimated while keeping the overhead low.
ResearchGate has not been able to resolve any references for this publication.