# Christian BauckhageUniversity of Bonn | Uni Bonn · Institute for Computer Sciences

Christian Bauckhage

Prof. Dr.

Informed machine learning and quantum computing for intelligent data analysis in the natural sciences and finance.

## About

530

Publications

433,246

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

9,727

Citations

Introduction

Christian is a professor of computer science at the University of Bonn and lead scientist for machine learning at Fraunhofer IAIS.
His research addresses questions in the broad area of data science. In particular, he works on theory and applications of artificial intelligence, machine learning, and data mining in the natural sciences, social media, and finance. For a couple of years now, he has been looking at quantum computing solutions for intelligent data analysis.

Additional affiliations

October 2008 - present

October 2008 - present

October 2008 - July 2019

## Publications

Publications (530)

In this note, we study k-medoids clustering and show how to implement the algorithm using NumPy. To illustrate potential and practical use of this lesser known clustering method, we discuss an application example where we cluster a data set of strings based on bi-gram distances.

In this note, we show how to use of NumPy mesh-grids and boolean arrays for efficient image processing. As an application example, we compute fractal images that visualize Julia-or Mandelbrot sets.

We revisit the idea of relational clustering and look at NumPy code for spectral clustering that allows us to cluster graphs or networks. In addition, our topic in this note provides us with the opportunity to study the use of NetworkX functions.

Ten days after our last note in this series, worldwide testing for COVID-19 infections has continued and and it seems that current case data can be better explained in terms of Gompertz growth functions rather than in terms of logistic growth functions. In this note, we therefore discuss the Gompertz function and its use in mathematical epidemiolog...

Archetypal analysis is an increasingly popular tool for data mining and pattern recognition. In this note, we first discuss how to solve the underlying optimization problem using plain vanilla Frank-Wolfe optimization and then present an efficient NumPy implementation of this approach.

Model-agnostic explanation methods for deep learning models are flexible regarding usability and availability. However, due to the fact that they can only manipulate input to see changes in output, they suffer from weak performance when used with complex model architectures. For models with large inputs as, for instance, in object detection, sampli...

Remote sensing and artificial intelligence are pivotal technologies of precision agriculture nowadays. The efficient retrieval of large-scale field imagery combined with machine learning techniques shows success in various tasks like phenotyping, weeding, cropping, and disease control. This work will introduce a machine learning framework for autom...

Rhizoctonia crown and root rot (RCRR), caused by Rhizoctonia solani, can cause severe yield and quality losses in sugar beet. The most common strategy to control the disease is the development of resistant varieties. In the breeding process, field experiments with artificial inoculation are carried out to evaluate the performance of genotypes and v...

The Rashomon Effect describes the following phenomenon: for a given dataset there may exist many models with equally good performance but with different solution strategies. The Rashomon Effect has implications for Explainable Machine Learning, especially for the comparability of explanations. We provide a unified view on three different comparison...

Ever-larger language models with ever-increasing capabilities are by now well-established text processing tools. Alas, information extraction tasks such as named entity recognition are still largely unaffected by this progress as they are primarily based on the previous generation of encoder-only transformer models. Here, we propose a simple yet ef...

Auditing financial documents is a very tedious and time-consuming process. As of today, it can already be simplified by employing AI-based solutions to recommend relevant text passages from a report for each legal requirement of rigorous accounting standards. However, these methods need to be fine-tuned regularly, and they require abundant annotate...

This note presents and discusses practical implementations of the
kind of super simple quantum Bayesian networks we studied earlier.
Our code examples are straightforward but assume familiarity with
Python, NumPy, and Qiskit.

Continuing our study of quantum computing for probabilistic inference and Bayesian reasoning, we discuss a circuit over two qubits which simultaneously computes all the joint probabilities of all the possible configurations of a system of two Bernoulli variables. We focus on the mathematics behind our circuit and prove its claimed behavior. Since w...

This note reports anecdotal evidence as to the potential of ChatGPT and Bard for AI assisted problem solving in quantum computing. We discuss exemplary interactions with both systems in which they exhibit theorem proving capabilities. While the problem we ask for is simple, it still requires university level mathematics which both language models a...

We discuss the conceptual connection between qubits and Bernoulli random variables and show how to program quantum computers to act as "coin flipping machines". Our practical code examples assume basic familiarity with Python, NumPy, and Qiskit.

This note begins a mini series on quantum circuits for probabilistic reasoning and Bayesian inference. Our overall goal is to explore an often overlooked application of quantum computing for AI. Here, however, we merely prepare ourselves for things to come and recall basic concepts in probabilistic reasoning. Among others, we cover sum-and product...

The Rashomon Effect describes the following phenomenon: for a given dataset there may exist many models with equally good performance but with different solution strategies. The Rashomon Effect has implications for Explainable Machine Learning, especially for the comparability of explanations. We provide a unified view on three different comparison...

Finding and amending contradictions in a financial report is crucial for the publishing company and its financial auditors. To automate this process, we introduce a novel approach that incorporates informed pre-training into its transformer-based architecture to infuse this model with additional Part-Of-Speech knowledge. Furthermore, we fine-tune t...

We introduce KPI-Check, a novel system that automatically identifies and cross-checks semantically equivalent key performance indicators (KPIs), e.g. "revenue" or "total costs", in real-world German financial reports. It combines a financial named entity and relation extraction module with a BERT-based filtering and text pair classification compone...

Many applications in automated auditing and the analysis and consistency check of financial documents can be formulated in part as the subset sum problem: Given a set of numbers and a target sum, find the subset of numbers that sums up to the target. The problem is NP-hard and classical solving algorithms are therefore not practical to use in many...

We introduce KPI-EDGAR, a novel dataset for Joint Named Entity Recognition and Relation Extraction building on financial reports uploaded to the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system, where the main objective is to extract Key Performance Indicators (KPIs) from financial documents and link them to their numerical values...

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to th...

The concept of Label Distribution Learning (LDL) is a technique to stabilize classification and regression problems with ambiguous and/or imbalanced labels. A prototypical use-case of LDL is human age estimation based on profile images. Regarding this regression problem, a so called Deep Label Distribution Learning (DLDL) method has been developed....

"Leichte Sprache", the German counterpart to Simple English, is a regulated language aiming to facilitate complex written language that would otherwise stay inaccessible to different groups of people. We present a new sentence-aligned monolingual corpus for Simple German -- German. It contains multiple document-aligned sources which we have aligned...

This note discusses a QUBO formulation of the subset sum problem. Cast as a QUBO, this combinatorial problem can be tackled using Hopfield nets. However, as subset sum is NP-complete, we cannot guarantee our Hopfield nets to always discover an optimal solution. Nevertheless, using multiple restarts, our simple NumPy implementation quickly and consi...

We once again revisit the minimum set cover problem and show that the underlying integer linear program can be rewritten as a quadratic unconstrained binary optimization problem. This can then be cast as an energy minimization problem which we solve by running Hopfield nets. Using multiple restarts, our simple NumPy implementation consistently prod...

We address the general problem of computing word embeddings and discuss a simple yet powerful solution involving intersection string kernels and kernel principal component analysis. We discuss the theory behind kernel PCA for word embeddings and present corresponding Python / NumPy code. Overall, we demonstrate that the whole framework is very easy...

This is the first in a miniseries of notes on kernel methods for language processing. We discuss the idea of measuring n-gram similarities of words by computing intersection string kernels and demonstrate that the Python standard library allows for compact implementations of this idea.

We explore the merits of training of support vector machines for binary classification by means of solving systems of ordinary differential equations. We thus assume a continuous time perspective on a machine learning problem which may be of interest for implementations on (re)emerging hardware platforms such as analog- or quantum computers.

In preparation for things to come, we discuss the general ideas behind AdaBoost (for binary classifier training) and present efficient NumPy code for boosting pre-trained weak hypotheses.

We present KPI-BERT, a system which employs novel methods of named entity recognition (NER) and relation extraction (RE) to extract and link key performance indicators (KPIs), e.g. "revenue" or "interest expenses", of companies from real-world German financial documents. Specifically, we introduce an end-to-end trainable architecture that is based...

Financial reports are commonplace in the business world, but are long and tedious to produce. These reports mostly consist of tables with written sections describing these tables. Automating the process of creating these reports, even partially has the potential to save a company time and resources that could be spent on more creative tasks. Some s...

The challenges and risks of deploying deep neural networks (DNNs) in the open-world are often overlooked and potentially result in severe outcomes. With our proposed informer approach, we leverage autoencoder-based outlier detectors with their sensitivity to epistemic uncertainty by ensembling multiple detectors each learning a different one-vs-res...

We revisit the minimum set cover problem and formulate it as an integer linear program over binary indicator vectors. Next, we simply adapt our earlier code for greedy set covering to indicator vector representations.

One-vs-Rest (OVR) classification aims to distinguish a single class of interest (COI) from other classes. The concept of novelty detection and robustness to dataset shift becomes crucial in OVR when the scope of the rest class is extended from the classes observed during training to unseen and possibly unrelated classes, a setting referred to as op...

In preparation for things to come, we discuss a plain vanilla Python implementation of "the" greedy approximation algorithm for the set cover problem.

Background
Unmanned aerial vehicle (UAV)–based image retrieval in modern agriculture enables gathering large amounts of spatially referenced crop image data. In large-scale experiments, however, UAV images suffer from containing a multitudinous amount of crops in a complex canopy architecture. Especially for the observation of temporal effects, thi...

Over the past decade, machine learning revolutionized vision-based quality assessment for which convolutional neural networks (CNNs) have now become the standard. In this paper, we consider a potential next step in this development and describe a quanvolutional neural network (QNN) algorithm that efficiently maps classical image data to quantum sta...

When training data is scarce, the incorporation of additional prior knowledge can assist the learning process. While it is common to initialize neural networks with weights that have been pre-trained on other large data sets, pre-training on more concise forms of knowledge has rather been overlooked. In this paper, we propose a novel informed machi...

Although CT and MRI are standard procedures in cirrhosis diagnosis, differentiation of etiology based on imaging is not established. This proof-of-concept study explores the potential of deep learning (DL) to support imaging-based differentiation of the etiology of liver cirrhosis. This retrospective, monocentric study included 465 patients with co...

Given is a set of images, where all images show views of the same area at different points in time and from different viewpoints. The task is the alignment of all images such that relevant information, e.g., poses, changes, and terrain, can be extracted from the fused image. In this work, we focus on quantum methods for keypoint extraction and feat...

We show that the fundamental tasks of sorting lists and building search trees or heaps can be modeled as quadratic unconstrained binary optimization problems (QUBOs). The idea is to understand these tasks as permutation problems and to devise QUBOs whose solutions represent appropriate permutation matrices. We discuss how to construct such QUBOs an...

The automatization and digitalization of business processes have led to an increase in the need for efficient information extraction from business documents. However, financial and legal documents are often not utilized effectively by text processing or machine learning systems, partly due to the presence of sensitive information in these documents...

Digital twins enable the modeling and simulation of real-world entities (objects, processes or systems), resulting in improvements in the associated value chains. The emerging field of quantum computing holds tremendous promise for evolving this virtualization towards Quantum (Digital) Twins (QDT) and ultimately Quantum Twins (QT). The quantum (dig...

The most damaging foliar disease in sugar beet is Cercospora leaf spot (CLS), caused by Cercospora beticola Sacc. The pathogen is expanding its territory due to climate conditions, generating the need for early and accurate detection to avoid yield losses. In Germany, monitoring and control strategies are based on visual field assessments, with the...

div>
Many applications in automated auditing and the analysis and consistency check of financial documents can be formulated in part as the subset sum problem: Given a set of numbers and a target sum, find the subset of numbers that sums up to the target. The problem is NP-hard and classical solving algorithms are therefore not practical to use in...

div>
Many applications in automated auditing and the analysis and consistency check of financial documents can be formulated in part as the subset sum problem: Given a set of numbers and a target sum, find the subset of numbers that sums up to the target. The problem is NP-hard and classical solving algorithms are therefore not practical to use in...

UAV-based image retrieval in modern agriculture enables gathering large amounts of spatially referenced crop image data. In large-scale experiments, however, UAV images suffer from containing a multitudinous amount of crops in a complex canopy architecture. Especially for the observation of temporal effects, this complicates the recognition of indi...

Just as user preferences change with time, item reviews also reflect those same preference changes. In a nutshell, if one is to sequentially incorporate review content knowledge into recommender systems, one is naturally led to dynamical models of text. In the present work we leverage the known power of reviews to enhance rating predictions in a wa...

We consider L2 support vector machines for binary classification. These are as robust as other kinds of SVMs but can be trained almost effortlessly. Indeed, having previously derived the corresponding dual training problem, we now show how to solve it using the Frank-Wolfe algorithm. In short, we show that it requires only a few lines of plain vani...

In times of climate change, growing world population, and the resulting scarcity of resources, efficient and economical usage of agricultural land is increasingly important and challenging at the same time. To avoid disadvantages of monocropping for soil and environment, it is advisable to practice intercropping of various plant species whenever po...

In order to avoid disadvantages of monocropping for soil and environment, it is advisable to practice intercropping of various plant species whenever possible. However, intercropping is challenging as it requires a balanced planting schedule due to individual cultivation time frames. Maintaining a continuous harvest reduces logistical costs and rel...

In this note, we introduce some of the common terminology in digital image processing. We also have a very first look at how to work with digital images in Python and discuss how to read and write them from-and to disc.

Financial reports are commonplace in the business world, but are long and tedious to produce. These reports mostly consist of tables with written sections describing these tables. Automating the process of creating these reports, even partially has the potential to save a company time and resources that could be spent on more creative tasks. We imp...

Financial reports are commonplace in the business world, but are long and tedious to produce. These reports mostly consist of tables with written sections describing these tables. Automating the process of creating these reports, even partially has the potential to save a company time and resources that could be spent on more creative tasks. We imp...

Neural networks have the potential to be extremely powerful for computer vision related tasks, but can be computationally expensive. Classical methods, by comparison, tend to be relatively light weight, albeit not as powerful. In this paper, we propose a method of combining parts from a classical system, called the Viola-Jones Object Detection Fram...

Just as user preferences change with time, item reviews also reflect those same preference changes. In a nutshell, if one is to sequentially incorporate review content knowledge into recommender systems, one is naturally led to dynamical models of text. In the present work we leverage the known power of reviews to enhance rating predictions in a wa...

Pre-symptomatic drought stress prediction is of great relevance in precision plant protection, ultimately helping to meet the challenge of "How to feed a hungry world?". Unfortunately, it also presents unique computational problems in scale and interpretability: it is a temporal, large-scale prediction task, e.g., when monitoring plants over time u...

We demonstrate that Hopfield networks can be used for hard vector quantization. To this end, we first formulate vector quantization as the problem of minimizing the mean discrepancy between kernel density estimates of two data distributions and then express it as a quadratic unconstrained binary optimization problem that can be solved by a Hopfield...

This note demonstrates that Hopfield nets can solve Sudoku puzzles. We discuss how to represent Sudokus in terms of binary vectors and how to express their rules and hints in terms of matrix-vector equations. This allows us to set up energy functions whose global minima encode the solution to a given puzzle. However, as these energy functions typic...

Within only a few years after the launch of video sharing platforms, viral videos have become a pervasive Internet phenomenon. Yet, notwithstanding growing scholarly interest, the suitability of the viral metaphor seems not to have been studied so far. In this paper, we therefore investigate the attention dynamics of viral videos from the point of...

Internet memes are phenomena that rapidly gain popularity or notoriety on the Internet. Often, modifications or spoofs add to the profile of the original idea thus turning it into a phenomenon that transgresses social and cultural boundaries. It is commonly assumed that Internet memes spread virally but scientific evidence as to this assumption is...

Internet memes are a pervasive phenomenon on the social Web. They typically consist of viral catch phrases, images, or videos that spread through instant messaging, (micro) blogs, forums, and social networking sites. Due to their popularity and proliferation, Internet memes attract interest in areas as diverse as marketing, sociology, or computer s...

Objective
To identify non-EEG based signals and algorithms for detection of motor and non-motor seizures in people lying in bed during video-EEG (VEEG) monitoring and to test whether these algorithms work in freely moving people during mobile EEG recordings.
Methods
Data of three groups of adult people with epilepsy (PwE) were analyzed. Group 1 un...

We approach least squares optimization from the point of view of gradient flows. As a practical example, we consider a simple linear regression problem, set up the corresponding differential equation, and show how to solve it using SciPy.

We revisit Hopfield nets for bipartition clustering and tweak the underlying energy function such that it has a unique global minimum. In other words, we show how to remove ambiguity from the bipartition clustering problem. Our corresponding NumPy code is short and simple.

Behavioral game analytics has predominantly been confined to work on single games, which means that the cross-game applicability of current knowledge remains largely unknown. Here four experiments are presented focusing on the relationship between game ownership, time invested in playing games, and the players themselves, across more than 3000 game...

Mobile digital games are dominantly released under the freemium business model, but only a small fraction of the players makes any purchases. The ability to predict who will make a purchase enables optimization of marketing efforts, and tailoring customer relationship management to the specific user's profile. Here this challenge is addressed via t...

We show how max-sum diversification can be used to solve the-clique problem, a well-known NP-complete problem. This reduction proves that max-sum diversification is NP-hard and provides a simple and practical method to find cliques of a given size using Hopfield networks.

We derive the dual problem of L2 support vector machine training. This involves setting up the Lagrangian of the primal problem and working with the Karush-Kuhn-Tucker conditions. As a payoff, we find that the dual poses a rather simple optimization problem that can be solved by the Frank-Wolfe algorithm.

We revisit Hopfield nets for bipartition clustering and show how to invoke the kernel trick to increase robustness and versatility. Our corresponding NumPy code is short and simple.

We show that Hopfield networks can cluster numerical data into two salient clusters. Our derivation of a corresponding energy function is based on properties of the specific problem of 2-means clustering. Our corresponding NumPy code is short and simple.