Conference PaperPDF Available

Figures

Content may be subject to copyright.
The Challenges of Creating, Maintaining and Exploring Graphs
of Financial Entities
Michael Loster, Tim Repke,
Ralf Krestel, Felix Naumann
Hasso Plattner Institute
Potsdam, Germany
rstname.lastname@hpi.de
Jan Ehmueller,
Benjamin Feldmann
Hasso Plattner Institute
Potsdam, Germany
rstname.lastname@student.hpi.de
Oliver Maspfuhl
Commerzbank
Frankfurt am Main, Germany
oliver.maspfuhl@commerzbank.com
1 OVERVIEW & MOTIVATION
The integration of a wide range of structured and unstructured
information sources into a uniformly integrated knowledge base
is an important task in the nancial sector. As an example, mod-
ern risk analysis methods can benet greatly from an integrated
knowledge base, building in particular a dedicated, domain-specic
knowledge graph. Knowledge graphs can be used to gain a holistic
view of the current economic situation so that systemic risks can
be identied early enough to react appropriately. The use of this
graphical structure thus allows the investigation of many nancial
scenarios, such as the impact of corporate bankruptcy on other
market participants within the network. In this particular scenario,
the links between the individual market participants can be used
to determine which companies are aected by a bankruptcy and to
what extent.
We took these considerations as a motivation to start the de-
velopment of a system capable of constructing and maintaining a
knowledge graph of nancial entities and their relationships. The
envisioned system generates this particular graph by extracting
and combining information from both structured data sources such
as Wikidata and DBpedia, as well as from unstructured data sources
such as newspaper articles and nancial lings. In addition, the
system should incorporate proprietary data sources, such as nan-
cial transactions (structured) and credit reports (unstructured). The
ultimate goal is to create a system that recognizes nancial enti-
ties in structured and unstructured sources, links them with the
information of a knowledge base, and then extracts the relations ex-
pressed in the text between the identied entities. The constructed
knowledge base can be used to construct the desired knowledge
graph. Our system design consists of several components, each of
which addresses a specic subproblem. To this end, Figure 1 gives
a general overview of our system and its subcomponents.
2 INGESTING STRUCTURED DATA TO THE
KNOWLEDGE BASE
In order to build up a knowledge base we start by integrating het-
erogeneous structured data sources, such as Wikidata or DBpedia.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
DSMM’18, Houston, TX, USA
©2018 ACM. 978-1-4503-5883-5/18/06. . . $15.00
DOI: 10.1145/3220547.3220553
The main challenge is to resolve entities, i.e., to recognize dierent
representations of the same company, person, product, etc. As a
starting point for our knowledge base construction, we use a man-
ually curated dataset of our industrial partner and merge it with
all other structured data sources. During this merge process, found
matches are used to enrich the existing entities in the knowledge
base with new information, while non-existing entities are added to
the knowledge base. Currently, we use a combination of traditional
similarity metrics, such as MongeElkan [
3
] and JaroWinkler [
4
],
for the deduplication process and are developing a novel method
based on neural networks that can be easily swapped in due to the
modular system architecture.
3 KNOWLEDGE BASE ENRICHMENT USING
UNSTRUCTURED DATA
The knowledge base generated from structured sources is further
enriched by information extracted from unstructured data sources,
such as newspaper articles and bank-internal documents. To ex-
tract the desired information, we apply natural language processing
techniques to each unstructured data source. The text mining com-
ponent used for this purpose consists of three subcomponents:
named entity recognition, entity linking, and relation extraction.
Named Entity Recognition. The named entity recognition (NER)
module is responsible for discovering and extracting mentions of
nancial entities from texts. The extraction technique used for this
purpose was developed to recognize company names in texts and
uses conditional random elds (CRF) classier to do so [
2
]. A key
advantage of this method is the possibility to integrate external
knowledge in the form of manually created dictionaries into the
training process of the classier. In this way, the classier is able
to recognize company mentions with a precision of 91.11% and a
recall of 78.82%.
Entity Linking. Subsequently, the extracted company names are
linked to the knowledge base previously constructed from the struc-
tured data sources. This task is handled by the entity linking (EL)
component. The currently employed EL module uses a simple fuzzy
matching approach to link the extracted company names with their
corresponding knowledge base entries. However, due to the modu-
lar system design, this preliminary linking strategy can be easily
replaced by more sophisticated entity-linking approaches such as
CohEEL [1].
DSMM’18, June 15, 2018, Houston, TX, USA M. Loster et al.
Unstructured
Data Sources
NER
Annotated Text Linked Text
RELEX
Extracted
Relations
Knowledge Base
Text Mining
Structured Data
Sources
Deduplication
Admin
End User
EL
Explore
Search
Filter
Data Explorer
Control
Monitor
Adjust
Curation
Figure 1: Overview of the system architecture
Relation Extraction. The task of the relation extraction (RELEX)
component is to detect relationships between the found nan-
cial entities. The RELEX component currently in use extracts co-
occurrence relationships between individual nancial entities found
within the same sentence. Initial analysis has shown that co-occurring
nancial entities often imply a business relationship.As with the
other modules, it is possible to replace the currently used RELEX
module with more advanced extraction techniques, such as the
technique presented by Zuo et al. [
6
] or even more advanced neural
network based techniques [
5
]. In particular, the approach proposed
by Zuo et al. is promising, as it is able to extract directional rela-
tionships where the arguments are of the same type, in our case,
company to company relationships. The resulting knowledge base
can then be used to create a knowledge graph which can be used
to analyze complex issues such as the spread of risk factors.
4 USER INTERFACES, INTERACTIONS AND
SYSTEM SPECIFICATIONS
The entire system runs on a distributed platform to manage the
processing of ever growing data volumes. A key aspect of the system
is its modular structure so that each of the subcomponents can
be easily interchanged to benet from future advancements in the
respective research elds. A novel aspect of the system is to separate
the end-user interface from a curation interface that is specially
designed to monitor the overall process and to take corrective
actions at dierent levels. On a ne-grained level, an administrator
can make minor changes, e.g. correcting a company name or an
address, while it is also possible to do larger operations, such as
reverting an entire deduplication run so that the knowledge base is
returned to its original state. The curation interface is essentially
intended as a tool for monitoring and controlling the knowledge
base construction process and thus does not address the needs of
the end user. It is primarily designed to inspect the results of each
individual processing step (duplication, entity linking, etc.) and
make corrections to mismatched entities using a specially designed
view.
The Data Explorer focuses on the needs of the end user (e.g., a
credit risk ocer) and is thus designed for exploring, inspecting
and searching the generated knowledge graph. Addressing these
needs, it allows the user to search for individual nodes of interest,
to further explore the graph, and to display the associated node
information from the knowledge base. Furthermore, the user is able
to lter the displayed knowledge graph by node and edge types,
such as displaying only nodes of type “company” or only edges
of type “is partner of”. Another important feature is the ability to
make suggestions for data changes in order to enable a continuous
improvement process of the knowledge base.
5 CONCLUSIONS
We presented our architecture for a exploratory nancial informa-
tion system based on a custom-built knowledge graph. We further
described the components of the processing pipeline in use to pop-
ulate the knowledge base. As a next step, we plan to evaluate the
system as a whole and do a ne-grained analysis of the results of
the individual components in the context of a use case from our
nancial partner.
REFERENCES
[1]
Toni Grütze, Gjergji Kasneci, Zhe Zuo, and Felix Naumann. 2016. CohEEL:
Coherent and ecient named entity linking through random walks. Journal of
Web Semantics 37–38 (2016), 75–89.
[2]
Michael Loster, Zhe Zuo, Felix Naumann, Oliver Maspfuhl, and Dirk Thomas.
2017. Improving Company Recognition from Unstructured Text by using Dic-
tionaries. In Proceedings of the International Conference on Extending Database
Technology (EDBT). 610–619.
[3]
Alvaro E. Monge and Charles P. Elkan. 1996. The Field Matching Problem:
Algorithms and Applications. 267–270.
[4]
William E. Winkler and Yves Thibaudeau. 1991. An application of the Fellegi-
Sunter model of Record Linkage to the 1990 US decennial census. US Bureau of
the Census (1991), 1–22.
[5]
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo
Xu. 2016. Attention-Based Bidirectional Long Short-Term Memory Networks for
Relation Classication. In Proceedings of the Annual Meeting of the Association
for Computational Linguistics (ACL). 207–2013.
[6]
Zhe Zuo, Michael Loster, Ralf Krestel Krestel, and Felix Naumann. 2017. Un-
covering Business Relationships: Context-sensitive Relationship Extraction for
Dicult Relationship Types. In Proceedings of the Conference on "Lernen, Wissen,
Daten, Analysen" (LWDA). 271–283.
... Zehra et al. [14] present a financial knowledge graph-based financial report query system utilizing Wikidata and DBpedia, but the construction process lacks repeatability. Loster et al. [15] highlight challenges in the entity resolution, maintenance, and exploration of such graphs. Cheng et al. [16] propose a multi-modality graph neural network (MAGNN) but do not provide sufficient details on the construction process. ...
Article
Full-text available
Accurately predicting financial entity performance remains a challenge due to the dynamic nature of financial markets and vast unstructured textual data. Financial knowledge graphs (FKGs) offer a structured representation for tackling this problem by representing complex financial relationships and concepts. However, constructing a comprehensive and accurate financial knowledge graph that captures the temporal dynamics of financial entities is non-trivial. We introduce FintechKG, a comprehensive financial knowledge graph developed through a three-dimensional information extraction process that incorporates commercial entities and temporal dimensions and uses a financial concept taxonomy that ensures financial domain entity and relationship extraction. We propose a temporal and relational graph convolutional network (RGCN)-based representation for FintechKG data across multiple timesteps, which captures temporal dependencies. This representation is then combined with FinBERT embeddings through a projection layer, enabling a richer feature space. To demonstrate the efficacy of FintechKG, we evaluate its performance using the example task of financial performance prediction. A logistic regression model uses these combined features and social media embeddings for performance prediction. We classify whether the revenue will increase or decrease. This approach demonstrates the effectiveness of FintechKG combined with textual information for accurate financial forecasting. Our work contributes a systematic FKG construction method and a framework that utilizes both relational and textual embeddings for improved financial performance prediction.
... To address such issues, this study proposes a novel Risk Contagion Causal Reasoning model called RC 2 R, which uses the logical reasoning capabilities of large language models (LLMs) [19], [20], [21], [22] to analyze the causal mechanisms behind risk contagion, based on factual and professional knowledge in the financial knowledge graphs (KGs) [23], [24]. Specifically, as shown in Fig. 1(a), we first perform random interventions on the financial KGs, and then guide the LLMs to conduct formal causal reasoning to identify the risk contagion paths. ...
Preprint
Financial risks trend to spread from one entity to another, ultimately leading to systemic risks. The key to preventing such risks lies in understanding the causal chains behind risk contagion. Despite this, prevailing approaches primarily emphasize identifying risks, overlooking the underlying causal analysis of risk. To address such an issue, we propose a Risk Contagion Causal Reasoning model called RC2R, which uses the logical reasoning capabilities of large language models (LLMs) to dissect the causal mechanisms of risk contagion grounded in the factual and expert knowledge embedded within financial knowledge graphs (KGs). At the data level, we utilize financial KGs to construct causal instructions, empowering LLMs to perform formal causal reasoning on risk propagation and tackle the "causal parrot" problem of LLMs. In terms of model architecture, we integrate a fusion module that aligns tokens and nodes across various granularities via multi-scale contrastive learning, followed by the amalgamation of textual and graph-structured data through soft prompt with cross multi-head attention mechanisms. To quantify risk contagion, we introduce a risk pathway inference module for calculating risk scores for each node in the graph. Finally, we visualize the risk contagion pathways and their intensities using Sankey diagrams, providing detailed causal explanations. Comprehensive experiments on financial KGs and supply chain datasets demonstrate that our model outperforms several state-of-the-art models in prediction performance and out-of-distribution (OOD) generalization capabilities. We will make our dataset and code publicly accessible to encourage further research and development in this field.
... While working on this thesis, I contributed to and worked on several other publications, which are, however, beyond the scope of this thesis. In particular, there are three workshop papers on the extraction of company relationships [95], rating their relevance [169], and discuss the challenges of maintaining a knowledge graph of financial entities [118]. We also published an introductory book chapter on the extraction and representation of financial entities, including their visualisation [166]. ...
Thesis
Full-text available
Text collections, such as corpora of books, research articles, news, or business documents are an important resource for knowledge discovery. Exploring large document collections by hand is a cumbersome but necessary task to gain new insights and find relevant information. Our digitised society allows us to utilise algorithms to support the information seeking process, for example with the help of retrieval or recommender systems. However, these systems only provide selective views of the data and require some prior knowledge to issue meaningful queries and asses a system’s response. The advancements of machine learning allow us to reduce this gap and better assist the information seeking process. For example, instead of sighting countless business documents by hand, journalists and investigator scan employ natural language processing techniques, such as named entity recognition. Al-though this greatly improves the capabilities of a data exploration platform, the wealth of information is still overwhelming. An overview of the entirety of a dataset in the form of a two-dimensional map-like visualisation may help to circumvent this issue. Such overviews enable novel interaction paradigms for users, which are similar to the exploration of digital geographical maps. In particular, they can provide valuable context by indicating how apiece of information fits into the bigger picture.This thesis proposes algorithms that appropriately pre-process heterogeneous documents and compute the layout for datasets of all kinds. Traditionally, given high-dimensional semantic representations of the data, so-called dimensionality reduction algorithms are usedto compute a layout of the data on a two-dimensional canvas. In this thesis, we focus on text corpora and go beyond only projecting the inherent semantic structure itself. Therefore,we propose three dimensionality reduction approaches that incorporate additional information into the layout process: (1) a multi-objective dimensionality reduction algorithm to jointly visualise semantic information with inherent network information derived from the underlying data; (2) a comparison of initialisation strategies for different dimensionality reduction algorithms to generate a series of layouts for corpora that grow and evolve overtime; (3) and an algorithm that updates existing layouts by incorporating user feedback provided by pointwise drag-and-drop edits. This thesis also contains system prototypes to demonstrate the proposed technologies, including pre-processing and layout of the data and presentation in interactive user interfaces.
... With CurEx, Loster et al. [37] demonstrate the entire pipeline of curating company networks extracted from text. They discuss the challenges of this system in the context of its application in a large financial institution [38]. Knowledge graphs about company relations are also handy beyond large-scale analyses of the general market situation. ...
Chapter
Full-text available
In our modern society, almost all events, processes, and decisions in a corporation are documented by internal written communication, legal filings, or business and financial news. The valuable knowledge in such collections is not directly accessible by computers as they mostly consist of unstructured text. This chapter provides an overview of corpora commonly used in research and highlights related work and state-of-the-art approaches to extract and represent financial entities and relations.The second part of this chapter considers applications based on knowledge graphs of automatically extracted facts. Traditional information retrieval systems typically require the user to have prior knowledge of the data. Suitable visualization techniques can overcome this requirement and enable users to explore large sets of documents. Furthermore, data mining techniques can be used to enrich or filter knowledge graphs. This information can augment source documents and guide exploration processes. Systems for document exploration are tailored to specific tasks, such as investigative work in audits or legal discovery, monitoring compliance, or providing information in a retrieval system to support decisions.
Book
Full-text available
This open access book covers the use of data science, including advanced machine learning, big data analytics, Semantic Web technologies, natural language processing, social media analysis, time series analysis, among others, for applications in economics and finance. In addition, it shows some successful applications of advanced data science solutions used to extract new knowledge from data in order to improve economic forecasting models. The book starts with an introduction on the use of data science technologies in economics and finance and is followed by thirteen chapters showing success stories of the application of specific data science methodologies, touching on particular topics related to novel big data sources and technologies for economic analysis (e.g. social media and news); big data models leveraging on supervised/unsupervised (deep) machine learning; natural language processing to build economic and financial indicators; and forecasting and nowcasting of economic variables through time series analysis. This book is relevant to all stakeholders involved in digital and data-intensive research in economics and finance, helping them to understand the main opportunities and challenges, become familiar with the latest methodological findings, and learn how to use and evaluate the performances of novel tools and frameworks. It primarily targets data scientists and business analysts exploiting data science technologies, and it will also be a useful resource to research students in disciplines and courses related to these topics. Overall, readers will learn modern and effective data science solutions to create tangible innovations for economic and financial applications.
Conference Paper
Full-text available
While named entity recognition is a much addressed research topic, recognizing companies in text is of particular difficulty. Company names are extremely heterogeneous in structure, a given company can be referenced in many different ways, their names include person names, locations, acronyms, numbers, and other unusual tokens. Further, instead of using the official company name, quite different colloquial names are frequently used by the general public. We present a machine learning (CRF) system that reliably recognizes organizations in German texts. In particular, we construct and employ various dictionaries, regular expressions, text context, and other techniques to improve the results. In our experiments we achieved a precision of 91.11% and a recall of 78.82%, showing significant improvement over related work. Using our system we were able to extract 263,846 company mentions from a corpus of 141,970 newspaper articles.
Article
Full-text available
This paper describes a methodology for computer matching the Post Enumeration Survey with the Census. Computer matching is the first stage of a process for producing adjusted Census counts. All crucial matching parameters are computed solely using characteristics of the files being matched. No a priori knowledge of truth of matches is assumed. No previously created lookup tables are needed. The methods are illustrated with numerical results using files from the 1988 Dress Rehearsal Census for which the truth of matches is known. Key words and phrases. EM Algorithm; String Comparator Metric; LP Algorithm; Decision Rule; Error Rate. 1. INTRODUCTION This paper describes a particular application of the Fellegi-Sunter (1969) model of record linkage. New computational methods are used for computer matching the Post Enumeration Survey (PES) with the Census. The PES is used to produce adjusted Census counts. Computer matching is the first stage of PES processing. All crucial matching paramete...
Article
In recent years, the ever-growing amount of documents on the Web as well as in digital libraries led to a considerable increase of valuable textual information about entities. Harvesting entity knowledge from these large text collections is a major challenge. It requires the linkage of textual mentions within the documents with their real-world entities. This process is called entity linking.Solutions to this entity linking problem have typically aimed at balancing the rate of linking correctness (precision) and the linking coverage rate (recall). While entity links in texts could be used to improve various Information Retrieval tasks, such as text summarization, document classification, or topic-based clustering, the linking precision is the decisive factor. For example, for topic-based clustering a method that produces mostly correct links would be more desirable than a high-coverage method that leads to more but also more uncertain clusters.We propose an efficient linking method that uses a random walk strategy to combine a precision-oriented and a recall-oriented classifier in such a way that a high precision is maintained, while recall is elevated to the maximum possible level without affecting precision. An evaluation on three datasets with distinct characteristics demonstrates that our approach outperforms seminal work in the area and shows higher precision and time performance than the most closely related state-of-the-art methods.
Conference Paper
To combine information from heterogeneous sources, equivalent data in the multiple sources must be identified. This task is the field matching problem. Specifically, the task is to determine whether or not two syntactic values are alternative designations of the same semantic entity. For example the addresses Dept. of Comput. Sci. and Eng., University of California, San Diego 9500 Gilman Dr. Dept. 0114, La Jolla CA, 92093 and UCSD, Computer Science and Engineering Department, CA 92093-0114 do designate the same department. This paper describes three field matching algorithms, and evaluates their performance on real-world datasets. One proposed method is the well-known Smith-Waterman algorithm for comparing DNA and protein sequences. Several applications of field matching in knowledge discovery are described briefly, including WEBFIND, which is a new software tool that discovers scientific papers published on the worldwide web. WEBFIND uses external information sources to guide its search for authors and papers. Like many other worldwide web tools WEBFIND needs to solve the field matching problem in order to navigate between information sources.
Uncovering Business Relationships: Context-sensitive Relationship Extraction for Difficult Relationship Types
  • Zhe Zuo
  • Michael Loster
  • Ralf Krestel Krestel
  • Felix Naumann
  • Zuo Zhe
Zhe Zuo, Michael Loster, Ralf Krestel Krestel, and Felix Naumann. 2017. Uncovering Business Relationships: Context-sensitive Relationship Extraction for Di cult Relationship Types. In Proceedings of the Conference on "Lernen, Wissen, Daten, Analysen" (LWDA). 271-283.