Content uploaded by Michael Loster
Author content
All content in this area was uploaded by Michael Loster on Jun 19, 2019
Content may be subject to copyright.
Comparing Features for Ranking Relationships
Between Financial Entities based on Text
Tim Repke
Hasso Plattner Institute
Potsdam, Germany
tim.repke@hpi.de
Michael Loster
Hasso Plattner Institute
Potsdam, Germany
michael.loster@hpi.de
Ralf Krestel
Hasso Plattner Institute
Potsdam, Germany
ralf.krestel@hpi.de
1 INTRODUCTION
Evaluating the credibility of a company is an important and complex
task for nancial experts. When estimating the risk associated
with a potential asset, analysts rely on large amounts of data from
a variety of dierent sources, such as newspapers, stock market
trends, and bank statements. Finding relevant information in mostly
unstructured data is a tedious task and examining all sources by
hand quickly becomes infeasible.
An important aspect of risk management are the relations of
a company of interest to other nancial entities. Automatically
extracting such relationships from unstructured text les, such
as 10-K lings, signicantly reduces the amount of manual work.
Such structured knowledge enables experts to quickly gain insight
into a company’s relationship network. However, not all extracted
relationships may be important in a given context. In this paper, we
propose an approach to rank extracted relationships based on text
snippets, such that important information can be displayed more
prominently.
2 DATASET
The dataset used for this work was provided in the context of the
FEIII Challenge 2017[
4
], which contains triples extracted from 10-K
and 10-Q lings, describing a relationship (role) between the ling
company and a mentioned nancial entity. Text snippets of three
sentences provide the context a relation appeared in. Relationships
are limited to ten predened roles (see table 1). Judging from their
respective text snippets, triples were labelled by experts according
to their relevance from a business perspective as irrelevant,neutral,
relevant, or highly relevant. There are 975 training samples from 25
10-K lings, and 900 triples for testing from 25 disjunct lings.
Task Description
.The challenge is aimed to explore methods that
automatically produce a ranking of triples with the same role by rel-
evance. This complements last year’s challenge to identify nancial
entities in free text[1].
Inter Annotator Agreement
.The Inter Annotator Agreement,
measured by Cohen’s Kappa[
2
], has a weighted average of
κ=
0
.
45,
which indicates a high level of disagreement. About 40% of training
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
DSMM’17, Chicago, IL, USA
©2017 ACM. 978-1-4503-5031-0/17/05. . . $15.00
DOI: http://dx.doi.org/10.1145/3077240.3077252
triples were rated by more than one expert, in the test set all triples
received at least three ratings.
3 OUR APPROACH
We rank the snippets for each role based on multi-class classiers,
which are trained on the experts’ labels. As input to a classier we
compare three feature sets to represent snippets, namely bag-of-
words (BOW), embeddings (EMB), and syntax features (SYN).
We use ensembles of four one-versus-rest Support Vector Clas-
siers (SVC) with sigmoid kernel, as well as random forests with
20 trees to classify snippets. The condence scores in an ensemble
are normalised using the softmax function. From that we derive
the ranking score as the maximum probability weighted by its
corresponding label. Class imbalance is adjusted for by weights.
In our experiments, the SVC model has proven to be a good
choice for BOW and EMB, but has shown unsatisfactory perfor-
mances on syntax features. Therefore, we chose random forests,
which perform much better in this case.
Although a model could learn role specic characteristics and
improve its performance, we found that due to the limited number
of training samples better results are achieved by learning one
model from all samples disregarding the role.
3.1 Bag-of-Words
Our rst model uses a simple bag-of-words representation of the
snippets to classify them. N-grams are extracted for
n=
1to 3and
are weighted based on information gain between the classes. In
order to reduce the feature space and guard against over-tting,
the most and least frequent terms are removed from the index.
3.2 Sentence Embeddings
Diculties with previously unseen examples might arise from the
limited training size. Word embeddings can alleviate this problem by
representing words in a 50- to 300-dimensional vector space. These
representations are learned by using unsupervised deep learning.
Internally, a neural network is trained to predict the following word
in a sequence of words based on the word’s context window.
We learned paragraph embeddings
1
from 25 of the original full
text 10-K ling documents containing 60k sentences (2m words).
Previous research has shown, that such embeddings manage to
outperform BOW approaches[
3
]. We use a window size of 10 and a
paragraph vector of size 50, which is trained for 10 epochs over the
sentences in all lings.
1Using Gensim https://radimrehurek.com/gensim/models/doc2vec.html
DSMM’17, May 14, 2017, Chicago, IL, USA T. Repke et al.
Table 1: Averaged experimental results for each role using BOW+EMB+SYN
aliate agent counterpart guarantor insurer issuer seller servicer trustee underwriter
# samples (train/eval) 185 61 64 34 19 129 20 21 420 21
# samples (test) 129 40 108 28 47 98 49 57 304 40
NDCG (5-fold-cv) 0.93 0.92 0.93 0.93 0.97 0.89 0.91 0.91 0.97 0.94
Baseline (random) 0.89 0.87 0.88 0.89 0.92 0.83 0.84 0.88 0.92 0.89
Table 2: Experimental results for bag-of-words (BOW), em-
bedding (EMB), syntax (SYN) features, and ensemble
Approach NDCG σ(NDCG)F1-Score σ(F1)
Baseline (random) 0.88 0.03 - -
Baseline (worst) 0.72 0.06 - -
BOW 0.88 0.05 0.34 0.13
EMB 0.89 0.04 0.24 0.18
SYN 0.94 0.04 0.44 0.11
BOW+EMB+SYN 0.95 0.04 0.43 0.12
To build the EMB representation for the text snippet associated
with a triple, the embedding is used to induce a vector for each of
the three sentences in the snippet, which are then concatenated.
3.3 Syntax Features
Additionally, to provide a language independent approach, we cre-
ated a set of syntax-based features. Following the Gini impurity
metric, features, such as the ratio of upper-case words and num-
bers, or the number of dollar signs and word repetitions, appear
to be most meaningful for classication. In total we chose 20 fea-
tures describing the amount or presence of dierent syntactical
characteristics.
3.4 Ensemble
Each of the numerical representations and their resulting models
have individual strengths and weaknesses. For example, the lan-
guage independence of SYN can tolerate a changing vocabulary to
a certain extent, but misses the advantage to identify key phrases
which may prove useful for classication. As a conclusion, we com-
bined the three models by summing the individual predictions to
form a soft vote.
4 EVALUATION
The system’s performance is measured by normalised discounted
cumulative gain (NDCG)[
2
]. We perform 5-fold cross-validation (5-
fold-cv) by leaving out training triples based on the documents they
were extracted from. Those triples form the evaluation set (eval).
Table 2 lists the mean NDCG scores and the standard deviation
(
σ
), which are calculated for each role’s ranking as shown for the
ensemble in table 1. For comparison, we consider a baseline of the
worst possible ranking (inverse order of the ideal ranking) and the
average of multiple random rankings.
The BOW model performs best on evaluation data (NDCG@0.98)
but the feature selection shows seemingly very specic terms which
are likely to negatively aect the model’s ability to classify unseen
samples, which is proven in by the test data. Training a model on
text usually requires a reasonably large corpus which we hope to
counteract by using embeddings based on 10-K lings. However,
even the EMB model barely beats the baseline (NDCG@0.92 on
eval). Best and most stable results are achieved with the SYN model
and the ensemble, which perform the same on the evaluation data
as on test data.
Looking at the performance of the classication task itself (mea-
sured by the F1-Score) we observe stable results for the ensemble
with low deviation on evaluation data, which is supported by same
scores on test data. Contrary to that is the BOW model, which
showed very high deviations and a signicant drop from F1-Score
0.73 on evaluation data to test data.
5 CONCLUSION
Overall, we managed to achieve good NDCG scores of around
0.95 using an ensemble of models. We have shown, that BOW is
very sensitive in changing vocabulary used in 10-K lings used in
training data to 10-Q lings as in the test data. Our assumption
that paragraph embeddings may be more robust to such changes
by training them on a signicantly larger set of text and is able to
reect phrase similarities did not hold. In combination with syntax
features in an ensemble, ranking triples describing the relationship
between nancial entities based on text snippets yields most stable
outcomes.
As this work only focuses on a small textual context, for future
work we are interested in additional external data, e.g. the impact
of a business relationship may be judged by comparing revenues of
the involved companies. Thus, triples could be enriched by adding
(historical) revenue of the two involved nancial entities.
REFERENCES
[1]
2016. DSMM’16: Proceedings of the Second International Workshop on Data Science
for Macro-Modeling. ACM, New York, NY, USA.
[2]
Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: Infor-
mation Retrieval in Practice. Pearson.
[3]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jerey Dean.
2013. Distributed Representations of Words and Phrases and their Composition-
ality. In Conference on Neural Information Processing Systems 2013.
[4]
Louiqa Raschid, Doug Burdick, Mark Flood, John Grant, Joe Langsam, Ian Sobo-
ro, and Elena Zotkina. 2017. Financial Entity Identication and Information In-
tegration (FEIII) Challenge 2017: The Report of the Organizing Committee. In Pro-
ceedings of the Workshop on Data Science for Macro-Modeling (DSMM@SIGMOD).