
Katharina Morik- Prof. Dr.
- Professor (Full) at TU Dortmund University
Katharina Morik
- Prof. Dr.
- Professor (Full) at TU Dortmund University
About
359
Publications
96,356
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,747
Citations
Introduction
Current institution
Publications
Publications (359)
With machine learning (ML) becoming a popular tool across all domains, practitioners are in dire need of comprehensive reporting on the state-of-the-art. Benchmarks and open databases provide helpful insights for many tasks, however suffer from several phenomena: Firstly, they overly focus on prediction quality, which is problematic considering the...
The last decade has witnessed a rapid growth of the field of exoplanet discovery and characterization. However, several big challenges remain, many of which could be addressed using machine learning methodology. For instance, the most prolific method for detecting exoplanets and inferring several of their characteristics, transit photometry, is ver...
The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth's atmosphere, is unknown. Because of deflection by interstellar magnetic fields, cosmic rays produced within the Milky Way arrive at Earth from random directions. However, cosmic rays interact with matter near their sources and during propagation, which produces hig...
With the ongoing integration of Machine Learning models into everyday life, e.g. in the form of the Internet of Things (IoT), the evaluation of learned models becomes more and more an important issue. Tree ensembles are one of the best black-box classifiers available and routinely outperform more complex classifiers. While the fast application of t...
Advances in artificial intelligence need to become more resource-aware and sustainable. This requires clear assessment and reporting of energy efficiency trade-offs, like sacrificing fast running time for higher predictive performance. While first methods for investigating efficiency have been proposed, we still lack comprehensive results for popul...
Multivariate Time Series (MTS) involve multiple time series variables that are interdependent. The MTS follows two dimensions, namely spatial along the different variables composing the MTS and temporal. Both, the complex and the time-evolving nature of MTS data make forecasting one of the most challenging tasks in time series analysis. Typical met...
Ensembles are among the state-of-the-art in many machine learning applications. With the ongoing integration of ML models into everyday life, e.g., in the form of the Internet of Things, the deployment and continuous application of models become more and more an important issue. Therefore, small models that offer good predictive performance and use...
State-of-the-art machine learning (ML) systems show exceptional qualitative performance, but can also have a negative impact on society. With regard to global climate change, the question of resource consumption and sustainability becomes more and more urgent. The enormous energy footprint of single ML applications and experiments was recently inve...
Machine Learning under Resource Constraints addresses novel machine learning algorithms that are challenged by high-throughput data, by high dimensions, or by complex structures of the data in three volumes. Resource constraints are given by the relation between the demands for processing the data and the capacity of the computing machinery. The re...
The integration of sensor and simulation data for Machine Learning (ML) has become a hot topic. On the one hand, machine learning in the form of Active Learning (AL) allows running exactly those simulations that are needed for predicting future events based on the analysis of explanatory variables. Since simulation is a timeconsuming process, we sa...
Interlinked manufacturing processes are characterized by the dependence of downstream process steps on the previous ones. If it can be predicted that a particular workpiece will not reach the desired quality, anticipatory measures can be taken early in the process. By its prediction, machine learning saves resources, both in processing and in the m...
An enormous amount of data is constantly being produced around the world, both in the form of large volume as in that of large velocity. Turning the data into information requires many steps of data analysis: methods for filtering and cleaning the data, joining heterogeneous sources, extracting and selecting features, summarizing and aggregating th...
For the interlinked process of producing steel bars, quality information is given in statistical form for whole charges of blocks, i.e. we know the fraction of blocks which had quality related problems. This poses a new kind of problem for machine learning, since almost all supervised learning methods assume to be given labels for individual observ...
Parameter sharing is a key technique in various state-of-the-art machine learning approaches. The underlying idea is simple yet effective. Given a highly overparametrized model whose input data obeys some repetitive structure, multiple subsets of parameters are tied together. On the one hand, this reduces the number of parameters, which simplifies...
As processing capabilities increase, more and more data is gathered every day everywhere on earth. While machines are becoming more and more capable of dealing with these large amounts of data, humans cannot keep up with the amount of data generated every day. They need small and comprehensive representative samples of data,which capture all the in...
Machine learning applications have become ubiquitous. Their applications range from embedded control in production machines over process optimization in diverse areas (e.g., traffic, finance, sciences) to direct user interactions like advertising and recommendations. This has led to an increased effort of making machine learning trustworthy. Explai...
Given the increase of publications, search for relevant papers becomes tedious. In particular, search across disciplines or schools of thinking is not supported. This is mainly due to the retrieval with keyword queries: technical terms differ in different sciences or at different times. Relevant articles might better be identified by their mathemat...
Both the complex and evolving nature of time series data make forecasting among one of the most challenging tasks in machine learning. Typical methods for forecasting are designed to model time-evolving dependencies between data observations. However, it is generally accepted that none of them are universally valid for every application. Therefore,...
In manufacturing systems, early quality prediction enables the execution of corrective measures as early as possible in the production chain, avoiding thus costly rework and waste of resources. The increasing development of Smart Factory Sensors and Industrial Internet of Things has offered wide opportunities for applying data-driven approaches for...
Online learning algorithms have become a ubiquitous tool in the machine learning toolbox and are frequently used in small, resource-constraint environments. Among the most successful online learning methods are Decision Tree (DT) ensembles. DT ensembles provide excellent performance while adapting to changes in the data, but they are not resource e...
Specialized hardware accelerators beyond von-Neumann, that offer processing capability in where the data resides without moving it, become inevitable in data-centric computing. Emerging non-volatile memories, like Ferroelectric Field-Effect Transistor (FeFET), are able to build compact Logic-in-Memory (LiM). In this work, we investigate the probabi...
The performance of machine learning algorithms depends to a large extent on the amount and the quality of data available for training. Simulations are most often used as test-beds for assessing the performance of trained models on simulated environment before deployment in real-world. They can also be used for data annotation, i.e, assigning labels...
Isolation forest (IF) is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlierness. Multiple extensions of this approach have been proposed i...
For timing-sensitive edge applications, the demand for efficient lightweight machine learning solutions has increased recently. Tree ensembles are among the state-of-the-art in many machine learning applications. While single decision trees are comparably small, an ensemble of trees can have a significant memory footprint leading to cache locality...
The scientific areas of artificial intelligence and machine learning are rapidly evolving and their scientific discoveries are drivers of scientific progress in areas ranging from physics or chemistry to life sciences and humanities. But machine learning is facing a re-producibility crisis that is clashing with the core principles of the scientific...
The linkage of machines in the context of Industry 4.0 through information and communication technology (ICT) to cyber-physical systems with the aim of monitoring, controlling, and optimizing complex production systems, enables real-time capable approaches for data acquisition, analysis, and process knowledge generation. In this context, surface mo...
In manufacturing systems, early quality prediction enables the execution of corrective measures as early as possible in the production chain, avoiding thus costly rework and waste of resources. The increasing development of Smart Factory Sensors and Industrial Internet of Things has offered wide opportunities for applying data-driven approaches for...
Online learning algorithms have become a ubiquitous tool in the machine learning toolbox and are frequently used in small, resource-constraint environments. Among the most successful online learning methods are Decision Tree (DT) ensembles. DT ensembles provide excellent performance while adapting to changes in the data, but they are not resource e...
Random Forests (RFs) are among the state-of-the-art in machine learning and offer excellent performance with nearly zero parameter tuning. Remarkably, RFs seem to be impervious to overfitting even though their basic building blocks are well-known to overfit. Recently, a broadly received study argued that a RF exhibits a so-called double-descent cur...
The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the “phase 2” of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are bei...
Random Forests (RF) are among the state-of-the-art in many machine learning applications. With the ongoing integration of ML models into everyday life, the deployment and continuous application of models becomes more and more an important issue. Hence, small models which offer good predictive performance but use small amounts of memory are required...
Deep neural networks such as Convolutional Neural Networks (CNNs) have been successfully applied to a wide variety of tasks, including time series forecasting.
In this paper, we propose a novel approach for online deep CNN selection using saliency maps in the task of time series forecasting. We start with an arbitrarily set of different CNN forecas...
We propose a novel explanation method that explains the decisions of a deep neural network by investigating how the intermediate representations at each layer of the deep network were refined during the training process. This way we can a) find the most influential training examples during training and b) analyze which classes attributed most to th...
Data summarization has become a valuable tool in understanding even terabytes of data. Due to their compelling theoretical properties, submodular functions have been the focus of summarization algorithms. Submodular function maximization is a well-studied problem with a variety of algorithms available. These algorithms usually offer worst-case guar...
Active class selection provides machine learning practitioners with the freedom to actively choose the class proportions of their training data. While this freedom can improve the model performance and decrease the data acquisition cost, it also puts the practical value of the trained model into question: is this model really appropriate for the cl...
Deep neural networks such as Convolutional Neural Networks (CNNs) have been successfully applied to a wide variety of tasks, including time series forecasting. In this paper, we propose a novel approach for online deep CNN selection using saliency maps in the task of time series forecasting. We start with an arbitrarily set of different CNN forecas...
Harnessing benefits and preventing harms of AI cannot be solved alone through technological fixes and regulation. It depends on a complex interplay between technology, societal governance, individual behaviour, organizational and societal dynamics. Enabling people to understand AI and the consequences of its use and design is a crucial element for...
Gamma hadron classification, a central machine learning task in gamma ray astronomy, is conventionally tackled with supervised learning. However, the supervised approach requires annotated training data to be produced in sophisticated and costly simulations. We propose to instead solve gamma hadron classification with a noisy label approach that on...
Ferroelectric FET (FeFET) is a highly promising emerging non-volatile memory (NVM) technology, especially for binarized neural network (BNN) inference on the low-power edge. The reliability of such devices, however, inherently depends on temperature. Hence, changes in temperature during run time manifest themselves as changes in bit error rates. In...
Continued improvements on existing reconstruction methods are vital to the success of high-energy physics experiments, such as the IceCube Neutrino Observatory. In IceCube, further challenges arise as the detector is situated at the geographic South Pole where computational resources are limited. However, to perform real-time analyses and to issue...
We propose a novel framework for classification using neural networks via an adversarial training procedure, in which we simultaneously train a main classifier---a neural network that solves the original classification task, i.e classifying instances into two main categories---and two meta-classifiers that act as discriminators and aim to detect fa...
Machine learning applications have become ubiquitous. This has led to an increased effort of making machine learning trustworthy. Explainable and fair AI have already matured. They address knowledgeable users and application engineers. For those who do not want to invest time into understanding the method or the learned model, we offer care labels:...
Data summarizations are a valuable tool to derive knowledge from large data streams and have proven their usefulness in a great number of applications. Summaries can be found by optimizing submodular functions. These functions map subsets of data to real values, which indicate their "representativeness" and which should be maximized to find a diver...
Machine learning applications have become ubiquitous. Their applications from machine embedded control in production over process optimization in diverse areas (e.g., traffic, finance, sciences) to direct user interactions like advertising and recommendations. This has led to an increased effort of making machine learning trustworthy. Explainable a...
Ensemble models are widely used as an effective
technique in time-series forecasting, and recently, are inclined
toward leveraging meta-learning methods due to their proven
predictive advantages in combining individual models in an
ensemble. However, finding the optimal strategy for ensemble
aggregation is an open research question, particularly, w...
Modern high-energy astroparticle experiments produce large amounts of data everyday in continuous high-volume streams. The First G-APD Cherenkov Telescope (FACT) aims at detecting particle showers of gamma rays, because cosmic events can be derived from the energy and angle of gamma rays. The separation of gamma rays from background noise, which is...
To overcome the memory wall in neural network (NN) inference systems, recent studies have proposed to use approximate memory, in which the supply voltage and access latency parameters are tuned, for lower energy consumption and faster access at the cost of reliability. To tolerate the occuring bit errors, the state-of-the-art approaches apply bit f...
Modern AI systems have become of widespread use in almost all sectors with a strong impact on our society. However, the very methods on which they rely, based on Machine Learning techniques for processing data to predict outcomes and to make decisions, are opaque, prone to bias and may produce wrong answers. Objective functions optimized in learnin...
To reduce the resource demand of neural network (NN) inference systems, it has been proposed to use approximate memory, in which the supply voltage and the timing parameters are tuned trading accuracy with energy consumption and performance. Tuning these parameters aggressively leads to bit errors, which can be tolerated by NNs when bit flips are i...
Continued improvements on existing reconstruction methods are vital to the success of high-energy physics experiments, such as the IceCube Neutrino Observatory. In IceCube, further challenges arise as the detector is situated at the geographic South Pole where computational resources are limited. However, to perform real-time analyses and to issue...
The optimization of submodular functions constitutes a viable way to perform clustering. Strong approximation guarantees and feasible optimization w.r.t. streaming data make this clustering approach favorable. Technically, submodular functions map subsets of data to real values, which indicate how "representative" a specific subset is. Optimal sets...
In the case of a sudden change in the geology in front of the Tunnel Boring Machine (TBM) during mechanized tunneling processes, non-appropriate investigation and process adaptation may result in non-desirable situations that can induce construction and machine defects. Therefore, subsurface anomalies detection is necessary to trigger alarm to upda...
Ensemble algorithms offer state of the art performance in many machine learning applications. A common explanation for their excellent performance is due to the bias-variance decomposition of the mean squared error which shows that the algorithm's error can be decomposed into its bias and variance. Both quantities are often opposed to each other an...
The last decade has witnessed a rapid growth of the field of exoplanet discovery and characterisation. However, several big challenges remain, many of which could be addressed using machine learning methodology. For instance, the most prolific method for detecting exoplanets and inferring several of their characteristics, transit photometry, is ver...
Data summarization has become a valuable tool in understanding even terabytes of data. Due to their compelling theoretical properties, submodular functions have been in the focus of summarization algorithms. These algorithms offer worst-case approximations guarantees to the expense of higher computation and memory requirements. However, many practi...
This strategic paper highlights selected research initiatives and success stories of technology transfer in the domain of artificial intelligence that have been driven by researchers in North Rhine-Westphalia (NRW). It was inspired by a round table on Artificial Intelligence (AI) initiated by the Ministry of Culture and Science (MKW) of NRW.
Spatio-temporal data sets such as satellite image series are of utmost importance for understanding global developments like climate change or urbanization. However, incompleteness of data can greatly impact usability and knowledge discovery. In fact, there are many cases where not a single data point in the set is fully observed. For filling gaps,...
The communication between data-generating devices is partially responsible for a growing portion of the world's power consumption. Thus reducing communication is vital, both, from an economical and an ecological perspective. For machine learning, on-device learning avoids sending raw data, which can reduce communication substantially. Furthermore,...
The use of machine learning methods to inform
consequential decisions is increasingly expanding across many
fields. As a result, the ability to interpret these models has become
to a greater extent crucial to increase the related-technologies
acceptance level and reliability. In this paper, we propose an
active sampling approach for learning accura...
To prevent undesirable effects during milling processes, online predictions of upcoming events can be used. Process simulations enable the capability to retrieve additional knowledge about the process, since their application allows the generation of data about characteristics, which cannot be measured during the process and can be incorporated as...
The Abstraction and Reasoning Corpus (ARC) comprising image-based logical reasoning tasks is intended to serve as a benchmark for measuring intelligence. Solving these tasks is very difficult for offthe-shelf ML methods due to their diversity and low amount of training data. We here present our approach, which solves tasks via grammatical evolution...
Given the increase of publications, search for relevant papers becomes tedious. In particular, search across disciplines or schools of thinking is not supported. This is mainly due to the retrieval with keyword queries: technical terms differ in different sciences or at different times. Relevant articles might better be identified by their mathemat...
Recommender engines play a role in the emergence and reinforcement of filter bubbles. When these systems learn that a user prefers content from a particular site, the user will be less likely to be exposed to different sources or opinions and, ultimately, is more likely to develop extremist tendencies. We trace the roots of this phenomenon to the w...
The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the phase 2 of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are being...
Both complex and evolving nature of time series structure make forecasting among one of the most important and challenging tasks in time series analysis. Typical methods for forecasting are designed to model time-evolving dependencies between data observations. However, it is generally accepted that none of them is universally valid for every appli...
The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the “phase 2” of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are bei...
On your search for scientific articles relevant to your research question, you judge the relevance of a mathematical expression that you stumble upon using extensive background knowledge about the domain, its problems and its notations. We wonder if machine learning can support this process and work toward implementing a search engine for mathemati...
Specialized hardware for machine learning allows us to train highly accurate models in hours which would otherwise take days or months of computation time. The advent of recent deep learning techniques can largely be explained by the fact that their training and inference rely heavily on fast matrix algebra that can be accelerated easily via progra...
Setting the optimal hyperparameters of a learning algorithm is a crucial task. Common approaches such as a grid search over the hyperparameter space or randomly sampling hyperparameters require many configurations to be evaluated in order to perform well. Hence, they either yield suboptimal hyperparameter configurations or are expensive in terms of...
Non-volatile memory, such as resistive RAM (RRAM), is an emerging energy-efficient storage, especially for low-power machine learning models on the edge. It is reported, however, that the bit error rate of RRAMs can be up to 3.3% in the ultra low-power setting, which might be crucial for many use cases. Binary neural networks (BNNs), a resource eff...
The communication between data-generating devices is partially responsible for a growing portion of the world’s power consumption. Thus reducing communication is vital, both, from an economical and an ecological perspective. For machine learning, on-device learning avoids sending raw data, which can reduce communication substantially. Furthermore,...
Now that data science receives a lot of attention, the three disciplines of data analysis, databases, and sciences are discussed with respect to the roles they play. In several discussions, I observed misunderstandings of Artificial Intelligence. Hence, it might be the right time to give a personal view of AI and the part of machine learning therei...
Data mining and Machine Learning research has led to a wide variety of training methods and algorithms for different types of models. Many of these methods solve or approximate NP-hard optimization problems at their core, using vastly different approaches, some algebraic , others heuristic. This paper demonstrates another way of solving these probl...
Both complex and evolving nature of time series structure make forecasting among one of the most important and challenging tasks in time series analysis. Typical methods for forecasting are designed to model time-evolving dependencies between data observations. However, it is generally accepted that none of them is universally valid for every appli...
When it comes to clustering nonconvex shapes, two paradigms are used to find the most suitable clustering: minimum cut and maximum density. The most popular algorithms incorporating these paradigms are Spectral Clustering and DBSCAN. Both paradigms have their pros and cons. While minimum cut clusterings are sensitive to noise, density-based cluster...
When it comes to clustering nonconvex shapes, two paradigms are used to find the most suitable clustering: minimum cut and maximum density. The most popular algorithms incorporating these paradigms are Spectral Clustering and DBSCAN. Both paradigms have their pros and cons. While minimum cut clusterings are sensitive to noise, density-based cluster...
Boolean matrix factorization (BMF) is a popular and powerful technique for inferring knowledge from data. The mining result is the Boolean product of two matrices, approximating the input dataset. The Boolean product is a disjunction of rank-1 binary matrices, each describing a feature-relation, called pattern, for a group of samples. Yet, there ar...
Given labeled data represented by a binary matrix, we consider the task to derive a Boolean matrix factorization which identifies commonalities and specifications among the classes. While existing works focus on rank-one factorizations which are either specific or common to the classes, we derive class-specific alterations from common factorization...
Mining and exploring databases should provide users with knowledge and new insights. Tiles of data strive to unveil true underlying structure and distinguish valuable information from various kinds of noise. We propose a novel Boolean matrix factorization algorithm to solve the tiling problem, based on recent results from optimization theory. In co...