Knowledge Discovery in Quarterly Financial
Data of Stocks Based on the Prime Standard
using a Hybrid of a Swarm with SOM
Michael C. Thrun1
1- University of Marburg, Mathematics and Computer Science
Hans-Meerwein Str., 35032 Marburg, Germany
Abstract. Stocks of the German Prime standard have to publish financial reports
every three months which were not used fully for fundamental analysis so far.
Through web scrapping, an up-to-date high-dimensional dataset of 45 features of
269 companies was extracted, but finding meaningful cluster structures in a high-
dimensional dataset with a low number of cases is still a challenge in data science.
A hybrid of a swarm with a SOM called Databionic swarm (DBS) found
meaningful structures in the financial reports. Using the Chord distance the DBS
algorithm results in a topographic map of high-dimensional structures and a
clustering. Knowledge from the clustering is acquired using CART. The cluster
structures can be explained by simple rules that allow predicting which future
stock courses will fall with a 70% probability.
Some of the companies listed in the Frankfurt stock exchange comply with a
rigorously defined transparency standard higher than the general standard of stocks
traded on the world market. This market segment of stocks is called German Prime
standard and requires companies to publish their accounting information .
Currently, the Prime standard contains 324 companies . Usually, fundamental
analysis has the purpose of predicting the market value of a stock. Without applying a
forecasting method, the “fundamental“ value of a stock is determined  and
compared to the current value. Some empirical studies have used accounting
information to predict the future performance of a firm . Often fundamental
analysis is a part of a larger system with the goal to select the right stock (e.g.,
CANSLIM system ) using a low amount of variables and a larger number of stocks
[5-7]. Thus, “research does not fully exploit the wealth of information contained in
general purpose financial reports but is outside of the primary financial statements”
. In this work, high-dimensional structures are investigated which are characterized
by 45 variables describing the quarterly financial statement, income sheet, and cash
flow defined by the German prime standard. The information is extracted directly
from the web  by applying a self-made web scrapping algorithm allowing the
extraction of data for 269 companies. The dataset was extracted for the first quarter of
2018, and the stock courses were extracted for the second quarter of 2019.
After preprocessing (i.e., handling of missing values, normalization, and
decorrelation) of data the Databionic swarm (DBS) algorithm can be applied  on
269 companies where 43 variables are used. The algorithm consists of three parts.
First, the high-dimensional data is projected into a two-dimensional space using a
swarm which utilizes game theory, self-organization, and emergence as well as swarm
intelligence . Besides the number of clusters and a single Boolean parameter
describing the type of structure, the DBS does not require any parameters to be set.
Here, the Chord distance (c.f.  p.49) is chosen for cluster analysis because
distributions analysis of distances indicates a bimodality (Fig. 1). Statistical testing
with Hartigan’s dip test  agrees that the distribution is not unimodal
(p(N=36046,D= 0.027821)<2.2e-16). Bimodality in the distance distributions serves
as an indicator that there are smaller intra-cluster distances and larger inter-cluster
distances resulting in the assumption that a cluster structure exists. Note, that in
general, all financial reports are quite dissimilar because there are no small distances
below 0.5 in Fig.1.
In the second part, the visualization of a topographic map is generated using a
simplified emergent self-organizing map  in order to visualize high-dimensional
structures. The topographic map accounts for projection errors because two-
dimensional similarities in the scatter plot cannot coercively represent high-
dimensional distances  and common quality measures of dimensionality reduction
methods require prior assumptions about the underlying structures . The
topographic map combines a 3D landscape with hypsometric tints. Hypsometric tints
are surface colors which depict ranges of elevation  intersected with contour lines.
“Blue colors indicate small distances (sea level), green and brown colors indicate
middle distances (small hilly country) and white colors indicate high distances (snow
and ice of high mountains). The valleys and basins indicate clusters and the
watersheds of hills and mountains indicate borderlines of c lusters” . In sum, this
visualization is consistent with a 3D landscape for the human eye enabling 3D
printing of a representation of high-dimensional structures; therefore one has a haptic
grasp and sees data structures intuitively enabling layman to interpret them .
In  it was shown that the visualized elevation between two projected points is an
approximation of the input-space distance D(l,j) between the two high-dimensional
points (here two companies). Voronoi cells around each projected point define the
abstract U-matrix (AU-matrix) and generate a Delaunay graph . For every projected
point all direct connections are weighted using the input-space distances D(l,j)
because each border between two Voronoi cells defines a height. All possible
weighted Delaunay paths between all points are calculated. Then, the minimum of
all possible path distances between a pair of points in the output space O is
calculated as the shortest path using the algorithm of  resulting in a new
high-dimensional distance which defines the distance of each two companies
based on their financial accounting. In this case, the connected structure type of DBS
clustering is chosen where the similarity between two subsets of data points is defined
as the minimum distance between data points in these subsets and the clustering
process is agglomerative (c.f. ): Let be the distance between two clusters c1 I
and c2 I, and let D(l,j) be the distance between two data points in the input space I;
then the connected approach is defined with .
Through inspecting the topographic map the central problem of estimation of the
number of clusters is solved by counting the number of valleys and the structure type
can be set. A dendrogram can be shown additionally. The clustering is valid if
mountains do not partition clusters indicated by colored points of the same color .
The high-dimensional structures of the financial reports of companies are visible in
the topographic map in Fig .3 where three valleys can be identified. The number of
valleys seen in the topographic map lead to the choice of three clusters. Additionally,
large changes in fusion levels of the ultrametric portion of the Chord distance indicate
the best cut (Fig.2, left, y-axis). After the clustering process, the heatmap in Fig. 2
indicates that more similar points are inside a cluster (yellow) and more dissimilar
points are outside a cluster (red). The three main clusters computed by the connected
approach of DBS are separated by mountain ranges (Fig. 3) and have an average
intra-cluster distance of c1=0.97, c2=1.01 and c3=0.91 being located in the first mode
of Fig. 1. The points in the topographic map symbolize the companies and are colored
by the clustering. Two outliers can be identified. The heatmap agrees with the
topographic map that the clusters consist of more similar points inside a cluster than
outside the cluster. Applying the CART algorithm  to the clustering yields a
simple set of rules (Fig. 5). If “net income from continuing ops” NIFCO < -264 and
“operating income or loss” OIL < -58.3 then class 2 is defined. NIFCO defines the
after-tax earnings that business has generated from its operational activities. NIFCO
“is considered to be a prime indicator of the financial health of a firm’s core
activities” . OIL is the difference between revenues and costs generated by
ordinary operations and before deducting interest, taxes et cetera . The extracted
rules for class 2 lead to the hypothesis that stock prices of companies in the second
cluster are overvalued and will fall in the next quarter. Using stock prices of Q2/2018
the hypothesis is verified under the assumption that the data of all companies is
available on the same date at the beginning of the respective quarter of a year. In Fig.
4 the class-dependent MD-plot  of the rate of return of stocks is shown. The
courses were compared at the last trading day of the first quarter against the last
trading day of the second quarter in 2018 with relative differences . The rate of
return in Class 2 is significantly lesser than in Class 1: Class 1 has a shift equal to
zero with a Wilcoxon rank sum test (shift not equal to zero p(N=196, V=9512)=0.86),
class 2 a shift not equal to zero with a Wilcoxon rank sum test p(N=54,W=6792)
<0.001. Stock courses of eight companies (3% of class 1 and 2 and outliers) could not
be extracted from the web.
Fig. 1: MirroredDensity plot (MD-plot ) of the R package ‘DataVisualizations’
on CRAN  shows a bimodal distribution of Chord distances with the first
mode around 1 and the second mode around 1.6.
Fig. 2: Dendrogram (left), and heatmap (right) of the distances sorted by the
clustering using the R package ‘DataVisualizations’ on CRAN . In the
heatmap, the smaller distances are in yellow belonging to a clusters and the
larger distances in red in-between different clusters (c.f. Fig. 1). The branches
of the dendrogram are colored by the first three clusters.
Fig. 3: The topographic map can visualize 43 dimensional, distance-based structures.
It shows three valleys - one major cluster with companies represented by
magenta points (N=199), one smaller cluster with companies in yellow (N=57)
and black (N=10) and two outliers (green and red).
Fig. 4: MDplot  of the rate of return in % calculated with relative differences .
The red line marks a rate of return of zero. Class 2 has 30% of stocks with a
rate of return above zero and significantly differs from Class 1.
Fig. 5: CART shows three distinct rules where outliers are incorrectly classified: “net
income from continuing ops” (NIFCO>=-264), “operating income or loss”
(OIL<-58.3) and “net income applicable to common shares” (NIACS>= 5.5).
The topographic map visualizes a 43 dimensional, distance-based structures in a 3D
landscape in Fig. 3. It showed three distinctive valleys leading to the hypothesis that
the dataset has three clusters. The topographic map in Fig. 3 is noisy because the
sample of data is small but the feature space is large, and there are no small distances
(see Fig. 1). However, the distributions analysis (Fig. 1) and the heatmap (Fig. 2)
indicate high-dimensional and distance-based structures of the data which can be
visualized by the topographic map (Fig. 3). These structures are reproduced in the
cluster analysis and a dendrogram. The heatmap could also indicate that cluster three
is a sub cluster of cluster one, yet the clustering is valid because mountains do not
partition clusters and the intra-cluster distances are smaller than the inter-cluster
distances (Fig. 2). The primary cluster (N=199) does not yield interesting insights
about the data which is typical for cluster analysis . However, the extracted rules
(Fig. 5) are interesting for the second cluster (N=57) where NIFCO has to be lower
than -264 and (OIL) lower than -58.3. By understanding these two variables through
the two rules, it can be concluded that the stock prices of these companies will fall in
the next quarter. This prediction can be verified by the significant higher probability
of a negative rate of return of stock prices of companies in class 2 in Q2/2018
compared to class 1. In sum, 7 out of 10 companies in class 2 lost value on the stock
market during the second quarter. In comparison, the success rate at stock picking by
a hybrid AI system was reported with on average 55.19 to 60.69% , and experts
had a success rate worse than chance . Thus, the clustering allows a data-driven
stock picking with a high chance of success for a short position.
This work presents an example in which the DBS algorithm made it possible to apply
cluster analysis to high-dimensional data where only a low number of cases exist.
Further examples and a comparison to common clustering algorithms as well as to
dimension reduction techniques are presented in . The clusters can be explained
by simple rules allowing to select stocks for the next quarter by simple thresholds in
the data (selling first and buying later). Besides the choice of the number clusters and
a Boolean parameter describing the type of structure, DBS is parameter-free and can
be downloaded as the R package “DatabionicSwarm” on CRAN.
Gratitude goes to Hamza Tayyab for programming the web scrapping algorithm which
extracted the quarterly data.
 Prime-Standard. Teilbereich des Amtlichen Marktes und des Geregelten Marktes der Deutschen Börse für
Unternehmen, die besonders hohe Transparenzstandards [18.09.2018]; from: http://de utsche-boerse.com/dbg-
 Gelistete Unter nehmen in Prime Standard, G.S.u.S., http://www.deutsche-boerse-cash-market.com/dbcm-
de/instrumente-statistiken/statistiken/gelistete-unternehmen. 2018, Deutsche Börse: Frankfurt.
 Abad, C., S.A. Thore, and J. Laffarga, Fundamental analysis of stocks by two
stage DEA. Managerial and
Decision Economics, 2004. 25(5): p. 231-241.
 O'Neil, W.J., How to make money in stocks. Vol. 10. 1988: McGraw-Hill New York.
 Deboeck, G.J. and A. Ultsch, Picking stocks with emergent self-organizing value maps. Neural Network World,
2000. 10(1): p. 203-216.
 Ou, J.A. and S. H. Penman, Financial statement analysis and the prediction of stock returns. Journal of accounting
and economics, 1989. 11(4): p. 295-329.
 Mohanram, P.S., Separating winners from losers among lowbook-to-market stocks using financial statement
analysis. Review of accounting studies, 2005. 10(2-3): p. 133-170.
 Richardson, S., I. Tuna, and P. Wysocki, Accounting anoma lies and fundamental analysis: A review of recent
research advances. Journal of Accounting and Economics, 2010. 50(2-3): p. 410-454.
 Yahoo! Finance. Income statement, Balance Sheet and Cash Flow. 2018 [cited 2018 29.09.2018]; Available from:
 Thrun, M.C. , Projection Based Clustering through Self-Organization and Swarm Intelligence. 2018, Heidelberg:
 McCune, B., J.B. Grace, and D.L. Urban, Analysis of ecological communities, chapter 6. Vol. 28. 2002: MjM
software design Gleneden Beach.
 Hartigan, J.A. and P.M. Hartigan, The dip test of unimodality. The annals of Statistics, 1985. 13(1): p. 70-84.
 Ultsch, A. and M.C. Thrun, Credible Vis ualizations for Planar Projections, in 12 th International Workshop on
Self-Organizing Maps and Learning Vector Quantization (WSOM). 2017, IEEE: Nany, France. p. 1-5.
 Dasgupta, S. and A. Gupta, An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures &
Algorithms, 2003. 22(1): p. 60-65.
 Thrun, M.C. and A. Ultsch. Investigating Quality measurements of projections for the Evaluation of Distance and
Density-based Structures of High-Dimensional Data. in European Conference on Data Analysis (ECDA).
2018. Paderborn, Germany.
 Patterson, T. and N.V. Kelso, Hal Shelton revisited: Designing and producing natural-color maps with satellite
land cover data. Cartographic Perspectives, 2004(47): p. 28-55.
 Thrun, M.C., et al., Visualization and 3D Pr inting of Multivariate Data of Biomarkers, in International Conference
in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), 2016: Plzen. p. 7-16.
 Lötsch, J. and A. Ultsch. Exploiting the Structures of the U-Matrix. in Advances in Self-Organizin g Maps and
Learning Vector Quantization. 2014. Mittweida, Germany: Springer International Publishing.
 Dijkstra, E.W., A note on two problems in connexion with graphs. Numerische mathematik, 1959. 1(1): p. 269-
 Breiman, L., et al., Classification and regression trees. 1984: CRC press.
 Bragg, S. Net income from continuing operations. 2018 [c ited 2018 03.11.2018]; Available from:
 Silver, C., et al. Operating Income. 2014 [cited 2018 03.11.2018]; Available from:
 Thrun, M.C. and A. Ultsch, Effects of the payout system of income taxes to municipalities in Germany, in 12th
Professor Aleksander Zelias International Conference on Modelling and Forecasting of Socio-Economic
Phenomena, 2018, Foundation of the Cracow Univers ity of Economics: Cracow, Poland. p. 533 -542.
 Ultsch, A., Is Log Ratio a Good Value for Measuring Return in Stock Investments?, in Advances in Data Analysis,
Data Handling and Business Intelligence. 2009, Springer. p. 505-511.
 Behnisch, M. and A. Ultsch, Knowledge Discovery in Spatial Planning Data: A Concept for Cluster
Understanding, in Computational Approaches for Urban Environments. 2015, Springer. p. 49-75.
 Tsaih, R., Y. Hsu, and C.C. Lai, Forecasting S&P 500 stock index futures with a hybrid AI system. Decision
Support Systems, 1998. 23(2): p. 161-174.
 Torngren, G. and H. Mont gomery, Worse than chance? Performance and confidence among professionals and
laypeople in the stock market. The Journal of Behaviora l Finance, 2004. 5(3): p. 148-153.
 Thrun, M.C. and A. Ultsch, Analyzing the Fine Structure of Distributions. Technica l Report being submitted, Dept.
of Mathematics and Computer Science, Philipps-U niversity of Marburg, 2019: Marburg: p.1-22.