Figure 2 - uploaded by Felix Naumann
Content may be subject to copyright.
Source publication
The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama pap...
Contexts in source publication
Context 1
... Entity Landscape Explorer (ELEX) shown in Figure 2 is de- signed to meet the needs of the end user. Since a knowledge graph can easily contain thousands of nodes and edges, a user must be able to efficiently examine the graph and the associated knowledge base. ...
Citations
... Domain-specific NER typically needs to introduce domain-specific (sub-)categories of the established named entity (NE) categories or entirely new categories. This is because domain-specific texts contain NE categories that are (1) detailed variants of the standard NE categories, e.g., "Person" is replaced with the domain-specific sub-categories "Players" and "Coaches" [27], (2) standard NE categories extended with a small number of new categories, e.g., "Trigger of a traffic jam" [11,19,22], and (3) domain-derived NE categories, e.g., "Proteins" in biology or "Reactions" in chemistry domains [9,18,25,30]. Most domain-derived NE categories originate from structured classifications or dictionaries [9,12,25] or are derived by manually unifying multiple of them [5]. ...
Named entity recognition (NER) is an important task that aims to resolve universal categories of named entities, e.g., persons, locations, organizations, and times. Despite its common and viable use in many use cases, NER is barely applicable in domains where general categories are suboptimal, such as engineering or medicine. To facilitate NER of domain-specific types, we propose ANEA, an automated (named) entity annotator to assist human annotators in creating domain-specific NER corpora for German text collections when given a set of domain-specific texts. In our evaluation, we find that ANEA automatically identifies terms that best represent the texts' content, identifies groups of coherent terms, and extracts and assigns descriptive labels to these groups, i.e., annotates text datasets into the domain (named) entities.
... This demands a long-term approach of KG curation in order to keep the KG material clean and valid, as well as to ensure quality. The KG curation is a "never-ending" process that includes two tasks [63,79,80]: i) Removing inconsistencies. The KG content should be continuously updated in order to maintain consistency and avoid contradictory statements. ...
Entity-centric knowledge graphs (KGs) are becoming increasingly popular for gathering information about entities. The schemas of KGs are semantically rich, with many different types and predicates to define the entities and their relationships. These KGs contain knowledge that requires understanding of the KG’s structure and patterns to be exploited. Their rich data structure can express entities with semantic types and relationships, oftentimes domain-specific, that must be made explicit and understood to get the most out of the data. Although different applications can benefit from such rich structure, this comes at a price. A significant challenge with KGs is the quality of their data. Without high-quality data, the applications cannot use the KG. However, as a result of the automatic creation and update of KGs, there are a lot of noisy and inconsistent data in them and, because of the large number of triples in a KG, manual validation is impossible. In this thesis, we present different tools that can be utilized in the process of continuous creation and curation of KGs. We first present an approach designed to create a KG in the accounting field by matching entities. We then introduce methods for the continuous curation of KGs. We present an algorithm for conditional rule mining and apply it on large graphs. Next, we describe RuleHub, an extensible corpus of rules for public KGs which provides functionalities for the archival and the retrieval of rules. We also report methods for using logical rules in two different applications: teaching soft rules to pre-trained language models (RuleBert) and explainable fact checking (ExpClaim).
... In recent years, systems such as DeepDive [41] and Knowledge Vault [11] and CurEx [30] have made it increasingly easy to automatically construct vast knowledge graphs (KGs) from structured and unstructured data. These systems often consist of many different components designed to extract and integrate facts from multiple sources. ...
... Chapter 5 introduces the prototypical CurEx system shown in Figure 1.2. It is based on the publication of Loster et al. [2018b] and enables the construction of domain-specific knowledge bases from structured and unstructured data sources. In addition to providing different user interfaces, CurEx enables the selective generation of multiple knowledge graphs and, thanks to its modular architecture, supports the step-by-step improvement of individual system components. ...
... In recent years, systems, such as DeepDive [Shin et al., 2015], Knowledge Vault [Dong et al., 2014], and CurEx [Loster et al., 2018b], made it increasingly easy to automatically construct vast knowledge graphs (KGs). These systems often consist of many different components designed to extract and integrate facts from numerous different sources. ...
... The content of this chapter is based on [Loster et al., 2018b] and is structured as follows: In Section 5.1, we give an overview of the CurEx system architecture and discuss the concrete implementations of both the structured and unstructured integration components. We introduce the different user interfaces and how they can be used to interact with the system in Section 5.2. ...
Modern knowledge bases contain and organize knowledge from many different topic areas. Apart from specific entity information, they also store information about their relationships amongst each other. Combining this information results in a knowledge graph that can be particularly helpful in cases where relationships are of central importance. Among other applications, modern risk assessment in the financial sector can benefit from the inherent network structure of such knowledge graphs by assessing the consequences and risks of certain events, such as corporate insolvencies or fraudulent behavior, based on the underlying network structure. As public knowledge bases often do not contain the necessary information for the analysis of such scenarios, the need arises to create and maintain dedicated domain-specific knowledge bases. This thesis investigates the process of creating domain-specific knowledge bases from structured and unstructured data sources. In particular, it addresses the topics of named entity recognition (NER), duplicate detection, and knowledge validation, which represent essential steps in the construction of knowledge bases. As such, we present a novel method for duplicate detection based on a Siamese neural network that is able to learn a dataset-specific similarity measure which is used to identify duplicates. Using the specialized network architecture, we design and implement a knowledge transfer between two deduplication networks, which leads to significant performance improvements and a reduction of required training data. Furthermore, we propose a named entity recognition approach that is able to identify company names by integrating external knowledge in the form of dictionaries into the training process of a conditional random field classifier. In this context, we study the effects of different dictionaries on the performance of the NER classifier. We show that both the inclusion of domain knowledge as well as the generation and use of alias names results in significant performance improvements. For the validation of knowledge represented in a knowledge base, we introduce Colt, a framework for knowledge validation based on the interactive quality assessment of logical rules. In its most expressive implementation, we combine Gaussian processes with neural networks to create Colt-GP, an interactive algorithm for learning rule models. Unlike other approaches, Colt-GP uses knowledge graph embeddings and user feedback to cope with data quality issues of knowledge bases. The learned rule model can be used to conditionally apply a rule and assess its quality. Finally, we present CurEx, a prototypical system for building domain-specific knowledge bases from structured and unstructured data sources. Its modular design is based on scalable technologies, which, in addition to processing large datasets, ensures that the modules can be easily exchanged or extended. CurEx offers multiple user interfaces, each tailored to the individual needs of a specific user group and is fully compatible with the Colt framework, which can be used as part of the system. We conduct a wide range of experiments with different datasets to determine the strengths and weaknesses of the proposed methods. To ensure the validity of our results, we compare the proposed methods with competing approaches.
... However, the sheer size of these datasets requires support by automated mechanisms in an otherwise unattainable task. With CurEx, Loster et al. [37] demonstrate the entire pipeline of curating company networks extracted from text. They discuss the challenges of this system in the context of its application in a large financial institution [38]. ...
In our modern society, almost all events, processes, and decisions in a corporation are documented by internal written communication, legal filings, or business and financial news. The valuable knowledge in such collections is not directly accessible by computers as they mostly consist of unstructured text. This chapter provides an overview of corpora commonly used in research and highlights related work and state-of-the-art approaches to extract and represent financial entities and relations.The second part of this chapter considers applications based on knowledge graphs of automatically extracted facts. Traditional information retrieval systems typically require the user to have prior knowledge of the data. Suitable visualization techniques can overcome this requirement and enable users to explore large sets of documents. Furthermore, data mining techniques can be used to enrich or filter knowledge graphs. This information can augment source documents and guide exploration processes. Systems for document exploration are tailored to specific tasks, such as investigative work in audits or legal discovery, monitoring compliance, or providing information in a retrieval system to support decisions.
Knowledge graph employs the specific graph structure to store knowledge in the form of entities, relations, attributes, etc., which can effectively represent correlations among data and has been applied in many fields, including search engine optimization, intelligent question answering, and recommendation systems. In this paper, we mainly focus on the research and application of the domain-specific knowledge graph in the field of the smart city, which has not been fully paid attention to. By constructing the corresponding knowledge graph, the data of urban traffic, services, and public resources are integrated to provide help for city builders and managers to make important decisions. Currently, the major problem faced by the smart city lies in data mining and proper application. On the one hand, data is usually stored by government management departments, which creates challenges such as high data storing overhead and inefficient data usage. On the other hand, data cannot be coordinated and collaborated between different city management systems, and data silos exist. Therefore, we will review the related literature on the knowledge graph existing in the smart city domain. Specifically, we will analyze and summarize knowledge graph construction research in the field of smart cities from four perspectives, i.e., smart city ontology, urban data processing, urban knowledge graph construction, and their application. Finally, the research limitations and prospects of the urban knowledge graph are provided.
Background
Cyber threats are increasing across all business sectors, with health care being a prominent domain. In response to the ever-increasing threats, health care organizations (HOs) are enhancing the technical measures with the use of cybersecurity controls and other advanced solutions for further protection. Despite the need for technical controls, humans are evidently the weakest link in the cybersecurity posture of HOs. This suggests that addressing the human aspects of cybersecurity is a key step toward managing cyber-physical risks. In practice, HOs are required to apply general cybersecurity and data privacy guidelines that focus on human factors. However, there is limited literature on the methodologies and procedures that can assist in successfully mapping these guidelines to specific controls (interventions), including awareness activities and training programs, with a measurable impact on personnel. To this end, tools and structured methodologies for assisting higher management in selecting the minimum number of required controls that will be most effective on the health care workforce are highly desirable.
Objective
This study aimed to introduce a cyber hygiene (CH) methodology that uses a unique survey-based risk assessment approach for raising the cybersecurity and data privacy awareness of different employee groups in HOs. The main objective was to identify the most effective strategy for managing cybersecurity and data privacy risks and recommend targeted human-centric controls that are tailored to organization-specific needs.
Methods
The CH methodology relied on a cross-sectional, exploratory survey study followed by a proposed risk-based survey data analysis approach. First, survey data were collected from 4 different employee groups across 3 European HOs, covering 7 categories of cybersecurity and data privacy risks. Next, survey data were transcribed and fitted into a proposed risk-based approach matrix that translated risk levels to strategies for managing the risks.
Results
A list of human-centric controls and implementation levels was created. These controls were associated with risk categories, mapped to risk strategies for managing the risks related to all employee groups. Our mapping empowered the computation and subsequent recommendation of subsets of human-centric controls to implement the identified strategy for managing the overall risk of the HOs. An indicative example demonstrated the application of the CH methodology in a simple scenario. Finally, by applying the CH methodology in the health care sector, we obtained results in the form of risk markings; identified strategies to manage the risks; and recommended controls for each of the 3 HOs, each employee group, and each risk category.
Conclusions
The proposed CH methodology improves the CH perception and behavior of personnel in the health care sector and provides risk strategies together with a list of recommended human-centric controls for managing a wide range of cybersecurity and data privacy risks related to health care employees.
This open access book covers the use of data science, including advanced machine learning, big data analytics, Semantic Web technologies, natural language processing, social media analysis, time series analysis, among others, for applications in economics and finance. In addition, it shows some successful applications of advanced data science solutions used to extract new knowledge from data in order to improve economic forecasting models.
The book starts with an introduction on the use of data science technologies in economics and finance and is followed by thirteen chapters showing success stories of the application of specific data science methodologies, touching on particular topics related to novel big data sources and technologies for economic analysis (e.g. social media and news); big data models leveraging on supervised/unsupervised (deep) machine learning; natural language processing to build economic and financial indicators; and forecasting and nowcasting of economic variables through time series analysis.
This book is relevant to all stakeholders involved in digital and data-intensive research in economics and finance, helping them to understand the main opportunities and challenges, become familiar with the latest methodological findings, and learn how to use and evaluate the performances of novel tools and frameworks. It primarily targets data scientists and business analysts exploiting data science technologies, and it will also be a useful resource to research students in disciplines and courses related to these topics. Overall, readers will learn modern and effective data science solutions to create tangible innovations for economic and financial applications.
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph’s local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.