ArticlePublisher preview available

DBkWik: extracting and integrating knowledge from thousands of Wikis

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Popular cross-domain knowledge graphs, such as DBpedia and YAGO, are built from Wikipedia, and therefore similar in coverage. In contrast, Wikifarms like Fandom contain Wikis for specific topics, which are often complementary to the information contained in Wikipedia, and thus DBpedia and YAGO. Extracting these Wikis with the DBpedia extraction framework is possible, but results in many isolated knowledge graphs. In this paper, we show how to create one consolidated knowledge graph, called DBkWik, from thousands of Wikis. We perform entity resolution and schema matching, and show that the resulting large-scale knowledge graph is complementary to DBpedia. Furthermore, we discuss the potential use of DBkWik as a benchmark for knowledge graph matching.
This content is subject to copyright. Terms and conditions apply.
Knowledge and Information Systems (2020) 62:2169–2190
https://doi.org/10.1007/s10115-019-01415-5
REGULAR PAPER
DBkWik: extracting and integrating knowledge from
thousands of Wikis
Sven Hertling1·Heiko Paulheim1
Received: 2 January 2019 / Revised: 3 October 2019 / Accepted: 5 October 2019 /
Published online: 2 November 2019
© Springer-Verlag London Ltd., part of Springer Nature 2019
Abstract
Popular cross-domain knowledge graphs, such as DBpedia and YAGO, are built from
Wikipedia, and therefore similar in coverage. In contrast, Wikifarms like Fandom contain
Wikis for specific topics, which are often complementary to the information contained in
Wikipedia, and thus DBpedia and YAGO. Extracting these Wikis with the DBpedia extrac-
tion framework is possible, but results in many isolated knowledge graphs. In this paper,
we show how to create one consolidated knowledge graph, called DBkWik, from thousands
of Wikis. We perform entity resolution and schema matching, and show that the resulting
large-scale knowledge graph is complementary to DBpedia. Furthermore, we discuss the
potential use of DBkWik as a benchmark for knowledge graph matching.
Keywords Knowledge graph creation ·Information extraction ·Linked open data ·
Knowledge graph matching
1 Introduction
General purpose knowledge graphs, such as DBpedia, YAGO, and Wikidata, have become
a central part of the linked open data cloud [49] and are among the most frequently used
datasets within the Web of data [8]. Such knowledge graphs contain information on millions
of entities from multiple topical domains [37].
Many of the popular knowledge graphs are created from Wikipedia and hence have a
similar coverage [47]. Generally speaking, each real-world entity for which a dedicated
Wikipedia page exists becomes an entity in the knowledge graph. This is a fundamental
restriction for many applications—for example, for building content-based recommender
systems backed by knowledge graphs, Di Noia et al. showed that the coverage of entities in
popular recommender system datasets in DBpedia is no more than 85% for movies, 63% for
music artists, and 31% for books [35].
BSven Hertling
sven@informatik.uni-mannheim.de
Heiko Paulheim
heiko@informatik.uni-mannheim.de
1Data and Web Science Group, University of Mannheim, Mannheim, Germany
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... For example, Figure 2.6 shows a snapshot from the Wikia infobox of the entity Zeus in Greek mythology and the Wiki markup table extracted from the dump file of the Wiki page 1 . In the case of unstructured texts, the lexical and dependency features are usually used to construct the semi-structured texts document-level pattern-based deep-learning based sentence-level [Soderland et al., 1995] [ Kim and Moldovan, 1995] [ Mooney, 1999] [Suchanek et al., 2007] [ Auer et al., 2007] [ Hertling and Paulheim, 2020] [ Hearst, 1992] [Huffman, 1995] [Carlson et al., 2010a] [ Nakashole et al., 2012] [ Riedel et al., 2013] [ Lin et al., 2015] [ Kambhatla, 2004] [ Zhou et al., 2005] [ Wang, 2008] [Gormley et al., 2015] [ Zhang et al., 2017b] [ Cui et al., 2018] [ BOOK> indicates the relation hasAuthor between a book and a person. The drawback of pattern-based methods is requiring intervention from human experts, hence costly and not scalable. ...
... Wikia is also constructed similarly to Wikipedia, with each universe is organized as a Wiki, so that it also contains pages of entities in the universe, infoboxes and category networks. With tremendous contribution from fans on creating the content, Wikia has become a great source for knowledge extraction in fictional domains [Hertling and Paulheim, 2020]. ...
... The DBkWik project [Hertling and Paulheim, 2020] has leveraged structured infoboxes of fan communities at wikia (now renamed to fandom.com), to construct a large KB of fictional characters and their salient properties. However, this is strictly limited to relations and respective instances that are present in infoboxes. ...
... The task of the track is to match pairs of knowledge graphs, whose schema and instances have to be matched simultaneously. The individual knowledge graphs are created by running the DBpedia extraction framework on eight different Wikis from the Fandom Wiki hosting platform 25 in the course of the DBkWik project [35,34]. They cover different topics (movies, games, comics and books) and three Knowledge Graph clusters sharing the same domain e.g. ...
... They are measured in terms of macro precision and recall. The results of non-specific systems are not reported here, as we could observe in the last campaigns that they can have intermediate results in tests of type ii) (same ontologies task) and poor performance in tests i) (different ontologies task).The detailed results can be investigated on the page of multifarm track results 35 . In terms of runtime, the results are not comparable to those from last year as the systems have been run in a different environment in terms of memory and number of processors. ...
Conference Paper
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity and use different evaluation modalities (e.g., blind evaluation, open evaluation, or consensus). The OAEI 2021 campaign offered 13 tracks and was attended by 21 participants. This paper is an overall presentation of that campaign.
... For the evaluation and analyses of the components of the framework developed we use the data extracted for the DB-kWik KG [15]. This data was made available for the OAEI 2020 Knowledge Graph track 7 in which the main task was developing systems capable of matching both the instances and the schema of KGs. ...
... The value was parsed by the Is date feature methodology and compared with the ground truth to evaluate the precision and recall. Table 4 compares our approach (dateutil + rules) with the Named-Entity Recognition (NER) modules of SpaCy 15 and Facebook's Duckling 16 . Our approach achieves significantly higher performance when identifying dates in our use case. ...
Article
Knowledge Graphs have emerged as a core technology to aggregate and publish knowledge on the Web. However, integrating knowledge from different sources, not specifically designed to be interoperable, is not a trivial task. Finding the right ontologies to model a dataset is a challenge since several valid data models exist and there is no clear agreement between them. In this paper, we propose to facilitate the selection of a data model with the RICDaM (Recommending Interoperable and Consistent Data Models) framework. RICDaM generates and ranks candidates that match entity types and properties in an input dataset. These candidates are obtained by aggregating freely available domain RDF datasets in a knowledge graph and then enriching the relationships between the graph’s entities. The entity type and object property candidates are obtained by exploiting the instances and structure of this knowledge graph to compute a score that considers both the accuracy and interoperability of the candidates. Datatype properties are predicted with a random forest model, trained on the knowledge graph properties and their values, so to make predictions on candidate properties and rank them according to different measures. We present experiments using multiple datasets from the library domain as a use case and show that our methodology can produce meaningful candidate data models, adaptable to specific scenarios and needs.
... Fig. 2 depicts the performance for extracting resources from Wikipedia pages. 4 It can be observed that the systems scales linearly with the number of ingoing links. 5 The approach of generating resources for URIs also determines the kind of SPARQL queries that our approach can process. ...
... While this demonstration has been based on DBpedia, it can be transferred to other approaches as well. With the same mechanism, it would be possible to use the extraction code of other Wikipedia-based knowledge graphs, such as YAGO [9] or the Linked Hypernym extraction [5], as well as to transfer the approach to other Wikis [4]. Also other refinement operators which are local to a Wikipedia page, such as extracting relations from text [2], would be applicable. ...
Preprint
Full-text available
Modern large-scale knowledge graphs, such as DBpedia, are datasets which require large computational resources to serve and process. Moreover, they often have longer release cycles, which leads to outdated information in those graphs. In this paper, we present DBpedia on Demand -- a system which serves DBpedia resources on demand without the need to materialize and store the entire graph, and which even provides limited querying functionality.
... In the course of time, DIEF was internationalized [5] and extended with support for over 140 languages; the DBpedia Ontology was created as an integration layer over all the Infobox template parameters; Wikidata and Wikimedia Commons were added to the extraction as well as the text of the articles. While there is always a residual of extraction errors, DIEF can be considered the state-of-the-art tool to squeeze the most and the best information from a huge variety of Wikimedia projects (and also other MediaWiki deployments [3]). Recently, DBpedia automated its extraction using the MARVIN bot that produces 22 billion triples per release using an agile workflow [4] and a variety of novel tests to ensure quality. ...
Article
Full-text available
This paper addresses one of the largest and most complex data curation workflows in existence: Wikipedia and Wikidata, with a high number of users and curators adding factual information from external sources via a non-systematic Wiki workflow to Wikipedia's infoboxes and Wikidata items. We present high-level analyses of the current state, the challenges and limitations in this workflow and supplement it with a quantitative and semantic analysis of the resulting data spaces by deploying DBpedia's integration and extraction capabilities. Based on an analysis of millions of references from Wikipedia infoboxes in different languages, we can find the most important sources which can be used to enrich other knowledge bases with information of better quality. An initial tool is presented, the GlobalFactSync browser, as a prototype to discuss further measures to develop a more systematic approach for data curation in the WikiVerse.
Article
Knowledge graphs are widely used in industry and studied within the academic community. However, the models applied in the development of knowledge graphs vary. Analysing and providing a synthesis of the commonly used approaches to knowledge graph development would provide researchers and practitioners a better understanding of the overall process and methods involved. Hence, this paper aims to define the overall process of knowledge graph development and its key constituent steps. For this purpose, a systematic review and a conceptual analysis of the literature was conducted. The resulting process was compared to case studies to evaluate its applicability. The proposed process suggests a unified approach and provides guidance for both researchers and practitioners when constructing and managing knowledge graphs.
Article
Knowledge fusion used for handling cross-domain or complex questions in conversation systems has received considerable attention and interest. However, most existing knowledge fusion methods rely on centralized server, which face many limitations and challenges, such as a single point of failure, content tampering, and entrusted contribution assignment. In this chapter, we present a novel blockchain-based conversation system framework based on a decentralized knowledge fusion scheme using blockchain smart contracts to guarantee transparency, traceability, and non-tampering. Furthermore, we implement a system prototype based on our proposed master-chain structure and consensus algorithm in the Fabric network, the evaluation results of three case studies show the feasibility and effectiveness of the proposed decentralized knowledge fusion design in a conversation system.
The 4-volume set LNAI 13013 – 13016 constitutes the proceedings of the 14th International Conference on Intelligent Robotics and Applications, ICIRA 2021, which took place in Yantai, China, during October 22-25, 2021. The 299 papers included in these proceedings were carefully reviewed and selected from 386 submissions. They were organized in topical sections as follows: Robotics dexterous manipulation; sensors, actuators, and controllers for soft and hybrid robots; cable-driven parallel robot; human-centered wearable robotics; hybrid system modeling and human-machine interface; robot manipulation skills learning; micro_nano materials, devices, and systems for biomedical applications; actuating, sensing, control, and instrumentation for ultra-precision engineering; human-robot collaboration; robotic machining; medical robot; machine intelligence for human motion analytics; human-robot interaction for service robots; novel mechanisms, robots and applications; space robot and on-orbit service; neural learning enhanced motion planning and control for human robot interaction; medical engineering.
Chapter
The application of Augmented Reality (AR) that present work instruction through visual elements for assembly operation can effectively reduce the cognitive load of operators. Among challenges that prevent AR systems from being widely used in complex assembly operations, the lack of augmented work instruction (AWI) adaptive presentation is an important aspect. This paper propose an augmented assembly work instruction knowledge graph (AWI-KG) for adaptive presentation to solve the problem. The characteristics of AWI are analyzed to abstract the concepts and relationships for domain ontology constructing. And using the domain ontology to extract the entities in assembly manual to establish the AWI-KG. Then, a collaborative filtering method based on AWI-KG is proposed. The multidimensional vectors are used to calculate the adaptive presentation mode for AWI according to the operator’s capabilities. The proposed method was applied in an AR system. The results show that the recommended visual elements can effectively adapt to the operator’s capabilities.
Conference Paper
Full-text available
The Ontology Alignment Evaluation Initiative (OAEI) aims at comparing ontology matching systems on precisely defined test cases. These test cases can be based on ontologies of different levels of complexity (from simple thesauri to expressive OWL ontologies) and use different evaluation modalities (e.g., blind evaluation, open evaluation, or consensus). The OAEI 2018 campaign offered 12 tracks with 23 test cases, and was attended by 19 participants. This paper is an overall presentation of that campaign.
Article
Full-text available
Large-scale knowledge graphs, such as DBpedia, Wikidata, or YAGO, can be enhanced by relation extraction from text, using the data in the knowledge graph as training data, i.e., using distant supervision. While most existing approaches use language-specific methods (usually for English), we present a language-agnostic approach that exploits background knowledge from the graph instead of language-specific techniques and builds machine learning models only from language-independent features. We demonstrate the extraction of relations from Wikipedia abstracts, using the twelve largest language editions of Wikipedia. From those, we can extract 1.6 M new relations in DBpedia at a level of precision of 95%, using a RandomForest classifier trained only on language-independent features. We furthermore investigate the similarity of models for different languages and show an exemplary geographical breakdown of the information extracted. In a second series of experiments, we show how the approach can be transferred to DBkWik, a knowledge graph extracted from thousands of Wikis. We discuss the challenges and first results of extracting relations from a larger set of Wikis, using a less formalized knowledge graph.
Conference Paper
Full-text available
Following the Linked Data principles means maximizing the reusability of data over the Web. Reuse of datasets can become apparent when datasets are linked to from other datasets, and referred in scientific articles or community discussions. It can thus be measured, similarly to citations of papers. In this paper we propose dataset reuse metrics and use these metrics to analyze indications of dataset reuse in different communication channels within a scientific community. In particular we consider mailing lists and publications in the Semantic Web community and their correlation with data interlinking. Our results demonstrate that indications of dataset reuse across different communication channels and reuse in terms of data interlinking are positively correlated.
Chapter
The Wikipedia category graph serves as the taxonomic backbone for large-scale knowledge graphs like YAGO or Probase, and has been used extensively for tasks like entity disambiguation or semantic similarity estimation. Wikipedia’s categories are a rich source of taxonomic as well as non-taxonomic information. The category German science fiction writers, for example, encodes the type of its resources (Writer), as well as their nationality (German) and genre (Science Fiction). Several approaches in the literature make use of fractions of this encoded information without exploiting its full potential. In this paper, we introduce an approach for the discovery of category axioms that uses information from the category network, category instances, and their lexicalisations. With DBpedia as background knowledge, we discover 703k axioms covering 502k of Wikipedia’s categories and populate the DBpedia knowledge graph with additional 4.4M relation assertions and 3.3M type assertions at more than 87% and 90% precision, respectively.
Conference Paper
DBpedia releases consist of more than 70 multilingual datasets that cover data extracted from different language-specific Wikipedia instances. The data extracted from those Wikipedia instances are transformed into RDF using mappings created by the DBpedia community. Nevertheless, not all the mappings are correct and consistent across all the distinct language-specific DBpedia datasets. As these incorrect mappings are spread in a large number of mappings, it is not feasible to inspect all such mappings manually to ensure their correctness. Thus, the goal of this work is to propose a data-driven method to detect incorrect mappings automatically by analyzing the information from both instance data as well as ontological axioms. We propose a machine learning based approach to building a predictive model which can detect incorrect mappings. We have evaluated different supervised classification algorithms for this task and our best model achieves 93% accuracy. These results help us to detect incorrect mappings and achieve a high-quality DBpedia.
Conference Paper
Hypernymy relations are an important asset in many applications, and a central ingredient to Semantic Web ontologies. The IsA database is a large collection of such hypernymy relations extracted from the Common Crawl. In this paper, we introduce WebIsALOD, a Linked Open Data release of the IsA database, containing 400M hypernymy relations, each provided with rich provenance information. As the original dataset contained more than 80% wrong, noisy extractions, we run a machine learning algorithm to assign confidence scores to the individual statements. Furthermore, 2.5M links to DBpedia and 23.7k links to the YAGO class hierarchy were created at a precision of 97%. In total, the dataset contains 5.4B triples.
Conference Paper
Public Knowledge Graphs (KGs) on the Web are considered a valuable asset for developing intelligent applications. They contain general knowledge which can be used, e.g., for improving data analytics tools, text processing pipelines, or recommender systems. While the large players, e.g., DBpedia, YAGO, or Wikidata, are often considered similar in nature and coverage, there are, in fact, quite a few differences. In this paper, we quantify those differences, and identify the overlapping and the complementary parts of public KGs. From those considerations, we can conclude that the KGs are hardly interchangeable, and that each of them has its strenghts and weaknesses when it comes to applications in different domains.