Conference PaperPDF Available

Comparing Features for Ranking Relationships Between Financial Entities based on Text

Authors:

Abstract and Figures

Evaluating the credibility of a company is an important and complex task for financial experts. When estimating the risk associated with a potential asset, analysts rely on large amounts of data from a variety of different sources, such as newspapers, stock market trends, and bank statements. Finding relevant information in mostly unstructured data is a tedious task and examining all sources by hand quickly becomes infeasible. An important aspect of risk management are the relations of a company of interest to other financial entities. Automatically extracting such relationships from unstructured text files, such as 10-K filings, significantly reduces the amount of manual work. Such structured knowledge enables experts to quickly gain insight into a company’s relationship network. However, not all extracted relationships may be important in a given context. In this paper, we propose an approach to rank extracted relationships based on text snippets, such that important information can be displayed more prominently.
Content may be subject to copyright.
Comparing Features for Ranking Relationships
Between Financial Entities based on Text
Tim Repke
Hasso Plattner Institute
Potsdam, Germany
tim.repke@hpi.de
Michael Loster
Hasso Plattner Institute
Potsdam, Germany
michael.loster@hpi.de
Ralf Krestel
Hasso Plattner Institute
Potsdam, Germany
ralf.krestel@hpi.de
1 INTRODUCTION
Evaluating the credibility of a company is an important and complex
task for nancial experts. When estimating the risk associated
with a potential asset, analysts rely on large amounts of data from
a variety of dierent sources, such as newspapers, stock market
trends, and bank statements. Finding relevant information in mostly
unstructured data is a tedious task and examining all sources by
hand quickly becomes infeasible.
An important aspect of risk management are the relations of
a company of interest to other nancial entities. Automatically
extracting such relationships from unstructured text les, such
as 10-K lings, signicantly reduces the amount of manual work.
Such structured knowledge enables experts to quickly gain insight
into a company’s relationship network. However, not all extracted
relationships may be important in a given context. In this paper, we
propose an approach to rank extracted relationships based on text
snippets, such that important information can be displayed more
prominently.
2 DATASET
The dataset used for this work was provided in the context of the
FEIII Challenge 2017[
4
], which contains triples extracted from 10-K
and 10-Q lings, describing a relationship (role) between the ling
company and a mentioned nancial entity. Text snippets of three
sentences provide the context a relation appeared in. Relationships
are limited to ten predened roles (see table 1). Judging from their
respective text snippets, triples were labelled by experts according
to their relevance from a business perspective as irrelevant,neutral,
relevant, or highly relevant. There are 975 training samples from 25
10-K lings, and 900 triples for testing from 25 disjunct lings.
Task Description
.The challenge is aimed to explore methods that
automatically produce a ranking of triples with the same role by rel-
evance. This complements last year’s challenge to identify nancial
entities in free text[1].
Inter Annotator Agreement
.The Inter Annotator Agreement,
measured by Cohen’s Kappa[
2
], has a weighted average of
κ=
0
.
45,
which indicates a high level of disagreement. About 40% of training
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
DSMM’17, Chicago, IL, USA
©2017 ACM. 978-1-4503-5031-0/17/05. . . $15.00
DOI: http://dx.doi.org/10.1145/3077240.3077252
triples were rated by more than one expert, in the test set all triples
received at least three ratings.
3 OUR APPROACH
We rank the snippets for each role based on multi-class classiers,
which are trained on the experts’ labels. As input to a classier we
compare three feature sets to represent snippets, namely bag-of-
words (BOW), embeddings (EMB), and syntax features (SYN).
We use ensembles of four one-versus-rest Support Vector Clas-
siers (SVC) with sigmoid kernel, as well as random forests with
20 trees to classify snippets. The condence scores in an ensemble
are normalised using the softmax function. From that we derive
the ranking score as the maximum probability weighted by its
corresponding label. Class imbalance is adjusted for by weights.
In our experiments, the SVC model has proven to be a good
choice for BOW and EMB, but has shown unsatisfactory perfor-
mances on syntax features. Therefore, we chose random forests,
which perform much better in this case.
Although a model could learn role specic characteristics and
improve its performance, we found that due to the limited number
of training samples better results are achieved by learning one
model from all samples disregarding the role.
3.1 Bag-of-Words
Our rst model uses a simple bag-of-words representation of the
snippets to classify them. N-grams are extracted for
n=
1to 3and
are weighted based on information gain between the classes. In
order to reduce the feature space and guard against over-tting,
the most and least frequent terms are removed from the index.
3.2 Sentence Embeddings
Diculties with previously unseen examples might arise from the
limited training size. Word embeddings can alleviate this problem by
representing words in a 50- to 300-dimensional vector space. These
representations are learned by using unsupervised deep learning.
Internally, a neural network is trained to predict the following word
in a sequence of words based on the word’s context window.
We learned paragraph embeddings
1
from 25 of the original full
text 10-K ling documents containing 60k sentences (2m words).
Previous research has shown, that such embeddings manage to
outperform BOW approaches[
3
]. We use a window size of 10 and a
paragraph vector of size 50, which is trained for 10 epochs over the
sentences in all lings.
1Using Gensim https://radimrehurek.com/gensim/models/doc2vec.html
DSMM’17, May 14, 2017, Chicago, IL, USA T. Repke et al.
Table 1: Averaged experimental results for each role using BOW+EMB+SYN
aliate agent counterpart guarantor insurer issuer seller servicer trustee underwriter
# samples (train/eval) 185 61 64 34 19 129 20 21 420 21
# samples (test) 129 40 108 28 47 98 49 57 304 40
NDCG (5-fold-cv) 0.93 0.92 0.93 0.93 0.97 0.89 0.91 0.91 0.97 0.94
Baseline (random) 0.89 0.87 0.88 0.89 0.92 0.83 0.84 0.88 0.92 0.89
Table 2: Experimental results for bag-of-words (BOW), em-
bedding (EMB), syntax (SYN) features, and ensemble
Approach NDCG σ(NDCG)F1-Score σ(F1)
Baseline (random) 0.88 0.03 - -
Baseline (worst) 0.72 0.06 - -
BOW 0.88 0.05 0.34 0.13
EMB 0.89 0.04 0.24 0.18
SYN 0.94 0.04 0.44 0.11
BOW+EMB+SYN 0.95 0.04 0.43 0.12
To build the EMB representation for the text snippet associated
with a triple, the embedding is used to induce a vector for each of
the three sentences in the snippet, which are then concatenated.
3.3 Syntax Features
Additionally, to provide a language independent approach, we cre-
ated a set of syntax-based features. Following the Gini impurity
metric, features, such as the ratio of upper-case words and num-
bers, or the number of dollar signs and word repetitions, appear
to be most meaningful for classication. In total we chose 20 fea-
tures describing the amount or presence of dierent syntactical
characteristics.
3.4 Ensemble
Each of the numerical representations and their resulting models
have individual strengths and weaknesses. For example, the lan-
guage independence of SYN can tolerate a changing vocabulary to
a certain extent, but misses the advantage to identify key phrases
which may prove useful for classication. As a conclusion, we com-
bined the three models by summing the individual predictions to
form a soft vote.
4 EVALUATION
The system’s performance is measured by normalised discounted
cumulative gain (NDCG)[
2
]. We perform 5-fold cross-validation (5-
fold-cv) by leaving out training triples based on the documents they
were extracted from. Those triples form the evaluation set (eval).
Table 2 lists the mean NDCG scores and the standard deviation
(
σ
), which are calculated for each role’s ranking as shown for the
ensemble in table 1. For comparison, we consider a baseline of the
worst possible ranking (inverse order of the ideal ranking) and the
average of multiple random rankings.
The BOW model performs best on evaluation data (NDCG@0.98)
but the feature selection shows seemingly very specic terms which
are likely to negatively aect the model’s ability to classify unseen
samples, which is proven in by the test data. Training a model on
text usually requires a reasonably large corpus which we hope to
counteract by using embeddings based on 10-K lings. However,
even the EMB model barely beats the baseline (NDCG@0.92 on
eval). Best and most stable results are achieved with the SYN model
and the ensemble, which perform the same on the evaluation data
as on test data.
Looking at the performance of the classication task itself (mea-
sured by the F1-Score) we observe stable results for the ensemble
with low deviation on evaluation data, which is supported by same
scores on test data. Contrary to that is the BOW model, which
showed very high deviations and a signicant drop from F1-Score
0.73 on evaluation data to test data.
5 CONCLUSION
Overall, we managed to achieve good NDCG scores of around
0.95 using an ensemble of models. We have shown, that BOW is
very sensitive in changing vocabulary used in 10-K lings used in
training data to 10-Q lings as in the test data. Our assumption
that paragraph embeddings may be more robust to such changes
by training them on a signicantly larger set of text and is able to
reect phrase similarities did not hold. In combination with syntax
features in an ensemble, ranking triples describing the relationship
between nancial entities based on text snippets yields most stable
outcomes.
As this work only focuses on a small textual context, for future
work we are interested in additional external data, e.g. the impact
of a business relationship may be judged by comparing revenues of
the involved companies. Thus, triples could be enriched by adding
(historical) revenue of the two involved nancial entities.
REFERENCES
[1]
2016. DSMM’16: Proceedings of the Second International Workshop on Data Science
for Macro-Modeling. ACM, New York, NY, USA.
[2]
Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: Infor-
mation Retrieval in Practice. Pearson.
[3]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jerey Dean.
2013. Distributed Representations of Words and Phrases and their Composition-
ality. In Conference on Neural Information Processing Systems 2013.
[4]
Louiqa Raschid, Doug Burdick, Mark Flood, John Grant, Joe Langsam, Ian Sobo-
ro, and Elena Zotkina. 2017. Financial Entity Identication and Information In-
tegration (FEIII) Challenge 2017: The Report of the Organizing Committee. In Pro-
ceedings of the Workshop on Data Science for Macro-Modeling (DSMM@SIGMOD).
... While working on this thesis, I contributed to and worked on several other publications, which are, however, beyond the scope of this thesis. In particular, there are three workshop papers on the extraction of company relationships [95], rating their relevance [169], and discuss the challenges of maintaining a knowledge graph of financial entities [118]. We also published an introductory book chapter on the extraction and representation of financial entities, including their visualisation [166]. ...
Thesis
Full-text available
Text collections, such as corpora of books, research articles, news, or business documents are an important resource for knowledge discovery. Exploring large document collections by hand is a cumbersome but necessary task to gain new insights and find relevant information. Our digitised society allows us to utilise algorithms to support the information seeking process, for example with the help of retrieval or recommender systems. However, these systems only provide selective views of the data and require some prior knowledge to issue meaningful queries and asses a system’s response. The advancements of machine learning allow us to reduce this gap and better assist the information seeking process. For example, instead of sighting countless business documents by hand, journalists and investigator scan employ natural language processing techniques, such as named entity recognition. Al-though this greatly improves the capabilities of a data exploration platform, the wealth of information is still overwhelming. An overview of the entirety of a dataset in the form of a two-dimensional map-like visualisation may help to circumvent this issue. Such overviews enable novel interaction paradigms for users, which are similar to the exploration of digital geographical maps. In particular, they can provide valuable context by indicating how apiece of information fits into the bigger picture.This thesis proposes algorithms that appropriately pre-process heterogeneous documents and compute the layout for datasets of all kinds. Traditionally, given high-dimensional semantic representations of the data, so-called dimensionality reduction algorithms are usedto compute a layout of the data on a two-dimensional canvas. In this thesis, we focus on text corpora and go beyond only projecting the inherent semantic structure itself. Therefore,we propose three dimensionality reduction approaches that incorporate additional information into the layout process: (1) a multi-objective dimensionality reduction algorithm to jointly visualise semantic information with inherent network information derived from the underlying data; (2) a comparison of initialisation strategies for different dimensionality reduction algorithms to generate a series of layouts for corpora that grow and evolve overtime; (3) and an algorithm that updates existing layouts by incorporating user feedback provided by pointwise drag-and-drop edits. This thesis also contains system prototypes to demonstrate the proposed technologies, including pre-processing and layout of the data and presentation in interactive user interfaces.
... Thus, automatically identifying the most relevant reported business relationships in newly released filings can significantly support professionals in their work. Repke et al. [60] use the surrounding text, where a mentioned business relation appears, to create a ranking to enrich dynamic knowledge graphs. There are also other ways to supplement the available information about relations. ...
Chapter
Full-text available
In our modern society, almost all events, processes, and decisions in a corporation are documented by internal written communication, legal filings, or business and financial news. The valuable knowledge in such collections is not directly accessible by computers as they mostly consist of unstructured text. This chapter provides an overview of corpora commonly used in research and highlights related work and state-of-the-art approaches to extract and represent financial entities and relations.The second part of this chapter considers applications based on knowledge graphs of automatically extracted facts. Traditional information retrieval systems typically require the user to have prior knowledge of the data. Suitable visualization techniques can overcome this requirement and enable users to explore large sets of documents. Furthermore, data mining techniques can be used to enrich or filter knowledge graphs. This information can augment source documents and guide exploration processes. Systems for document exploration are tailored to specific tasks, such as investigative work in audits or legal discovery, monitoring compliance, or providing information in a retrieval system to support decisions.
Book
Full-text available
This open access book covers the use of data science, including advanced machine learning, big data analytics, Semantic Web technologies, natural language processing, social media analysis, time series analysis, among others, for applications in economics and finance. In addition, it shows some successful applications of advanced data science solutions used to extract new knowledge from data in order to improve economic forecasting models. The book starts with an introduction on the use of data science technologies in economics and finance and is followed by thirteen chapters showing success stories of the application of specific data science methodologies, touching on particular topics related to novel big data sources and technologies for economic analysis (e.g. social media and news); big data models leveraging on supervised/unsupervised (deep) machine learning; natural language processing to build economic and financial indicators; and forecasting and nowcasting of economic variables through time series analysis. This book is relevant to all stakeholders involved in digital and data-intensive research in economics and finance, helping them to understand the main opportunities and challenges, become familiar with the latest methodological findings, and learn how to use and evaluate the performances of novel tools and frameworks. It primarily targets data scientists and business analysts exploiting data science technologies, and it will also be a useful resource to research students in disciplines and courses related to these topics. Overall, readers will learn modern and effective data science solutions to create tangible innovations for economic and financial applications.
Conference Paper
This report presents the goals and outcomes of the 2017 Financial Entity Identification and Information Integration (FEIII) Challenge. We describe the dataset and challenge task and the protocol to create labeled data. The report summarizes the process, outcomes and plans for the 2018 Challenge.
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Conference Paper
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large num- ber of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alterna- tive to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example,we present a simplemethod for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Conference Paper
This report presents the goals and outcomes of the 2017 Financial Entity Identification and Information Integration (FEIII) Challenge. We describe the dataset and challenge task and the protocol to create labeled data. The report summarizes the process, outcomes and plans for the 2018 Challenge.
Ian Soboroff, Financial Entity Identification and Information Integration (FEIII) 2017 Challenge: The Report of the Organizing Committee, Proceedings of the 3rd International Workshop on Data Science for Macro--Modeling with Financial and Economic Datasets
  • Douglas Louiqa Raschid
  • Mark Burdick
  • John Flood
  • Joe Grant
  • Langsam