
Alfred UltschPhilipps University of Marburg | PUM · Faculty of Mathematics and Computer Science
Alfred Ultsch
University Professor (full C4)
Discover new and useful knowledge in high dimensional Biomedical data = differential expressed genes and flow cytometry
About
326
Publications
99,474
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,969
Citations
Introduction
Databionics is my main research interest:
Databionics means the transfer of data processing techniques from nature to computers.
The cooperation of ants, bees, and other swarms produces emergent structures with novel and unforseen properties. The algorithms motivated by nature create new and useful knowledge from data.
We allow the data to process itself!
Successful applications of the methods are in medicine, meteorology, biology, pharmacy, stock prediction and customer relation management.
Additional affiliations
February 1998 - February 1999
August 1979 - October 1980
January 1992 - October 2020
Publications
Publications (326)
Background: Small sample sizes in biomedical research often led to poor reproducibility and challenges in translating findings into clinical applications. This problem stems from limited study resources, rare diseases, ethical considerations in animal studies, costly expert diagnosis, and others. As a contribution to the problem, we propose a novel...
Background: Clustering on projected data is common in biomedical research analysis. Principal component analysis (PCA) is widely used for projection, focusing on data dispersion (variance), while clustering identifies data concentrations (neighborhood), which are conflicting aims. This study re-evaluates combinations of PCA and other projection met...
Background: Fold change is a common metric in biomedical research for quantifying group differences in omics variables. However, inconsistent calculation methods and inadequate reporting lead to discrepancies in results. This study evaluated various fold-change calculation methods aiming at a recommendation of a preferred approach. Methods: The pri...
Background: Fold change is widely used in biomedical research to quantify the magnitude of group differences in omics variables. However, the exact calculation method is often not reported, leading to inconsistent results. This study re-evaluates different fold-change calculation methods and provides a clear preference. Methods: Data scenarios with...
Diagnostic immunophenotyping of malignant non-Hodgkin-lymphoma (NHL) by multiparameter flow cytometry (MFC) relies on highly trained physicians. Artificial intelligence (AI) systems have been proposed for this diagnostic task, often requiring more learning examples than are usually available. In contrast, Flow XAI has reduced the number of needed l...
Typical state-of-the-art flow cytometry data samples typically consist of measures of 10 to 30 features of more than 100,000 cell “events”. Artificial intelligence (AI) systems are able to diagnose such data with almost the same accuracy as human experts. However, such systems face one central challenge: their decisions have far-reaching consequenc...
Background: Psoriatic arthritis (PsA) is a chronic inflammatory systemic disease whose activity is often
assessed using the Disease Activity Score 28 (DAS28-CRP). The present study was designed to
investigate the significance of individual components within the score for PsA activity.
Methods: A cohort of 80 PsA patients (44 women and 36 men, aged...
Background: Random walks describe stochastic processes characterized by a sequence of unpredictable changes in a random variable with no correlation to past changes. This report describes a random walk component of a clinical sensory test of olfactory performance. The precise definition of this stochastic process allows the establishment of precise...
Recent advances in mathematical modeling and artificial intelligence have challenged the use of traditional regression analysis in biomedical research. This study examined artificial data sets %data set changed to data set throughout for consistency of terminology; please verify OK and biomedical data sets from cancer research using binomial and mu...
Recent advances in mathematical modelling and artificial intelligence have challenged the use of traditional regression analysis in biomedical research. This study examined artificial and cancer research data using binomial and multinomial logistic regression and compared its performance with other machine learning models such as random forests, su...
Background
Psoriatic arthritis (PsA) is a chronic inflammatory systemic disease that is often categorized based on the Disease Activity Score 28 (DAS-28 CRP). However, since DAS28-CRP was originally designed for rheumatoid arthritis, it may not perfectly reflect PsA, and periodic re-evaluation has been recommended.
Methods
A cohort of 80 PsA patie...
The importance of appropriate visualisation of raw data in biomedical reports, with a focus on pain.
Selecting the k best features is a common task in machine learning. Typically, a few features have high importance, but many have low importance (right-skewed distribution). This report proposes a numerically precise method to address this skewed feature importance distribution in order to reduce a feature set to the informative minimum of items. C...
Background
Random walks describe stochastic processes that result from a sequence of indeterminate changes in a random variable that are not correlated with past changes. This report describes a random walk component of a clinical sensory test of olfactory performance. The formal description of the stochastic process during the clinical test allows...
Research data obtained during economics or human studies experiments often displays a complex distribution. Even in the two-dimensional case, the statistical identification of subgroups in research data poses an analytical challenge. Here we introduce an interactive R-based tool called “AdaptGauss2D”. It enables a valid identification of a meaningf...
Background
Clustering on projected data is a common component of the analysis of biomedical research datasets. Among projection methods, principal component analysis (PCA) is the most commonly used. It focuses on the dispersion (variance) of the data, whereas clustering attempts to identify concentrations (neighborhoods) within the data. These may...
Background:
The International Prognostic Index (IPI) is applied to predict the outcome of chronic lymphocytic leukemia (CLL) with five prognostic factors, including genetic analysis. We investigated whether multiparameter flow cytometry (MPFC) data of CLL samples could predict the outcome by methods of explainable artificial intelligence (XAI). Fu...
Background
Selecting the k best features is a common task in machine-learning. Typically, a few variables have high importance, but many have low importance (right skewed distribution). This report proposes a numerically precise method to address this skewed feature importance distribution to reduce a feature set to the informative minimum of items...
Feature selection is a common step in data preprocessing that precedes machine learning to reduce data space and the computational cost of processing or obtaining the data. Filtering out uninformative variables is also important for knowledge discovery. By reducing the data space to only those components that are informative to the class structure,...
Bayesian inference is ubiquitous in science and widely used in biomedical research such as cell sorting or {\textquotedbl}omics{\textquotedbl} approaches, as well as in machine learning (ML) and artificial neural networks, and {\textquotedbl}big data{\textquotedbl} applications. However, the calculation is not robust in regions of low evidence. In...
“Big omics data” provoke the challenge of extracting meaningful information with clinical benefit. Here, we propose a two-step approach, an initial unsupervised inspection of the structure of the high dimensional data followed by supervised analysis of gene expression levels, to reconstruct the surface patterns on different subtypes of acute myeloi...
Motivation: Gaussian mixture models (GMMs) are probabilistic models commonly used in biomedi-cal research to detect subgroup structures in data sets with one-dimensional information. Reliable model parameterization requires that the number of modes, i.e., states of the generating process, is known. However, this is rarely the case for empirically m...
Minimal residual disease (MRD) detection is a strong predictor for survival and relapse in acute myeloid leukemia (AML). MRD can be either determined by molecular assessment strategies or via multiparameter flow cytometry. The degree of bone marrow (BM) dilution with peripheral blood (PB) increases with aspiration volume causing consecutive underes...
Background: The collection of increasing amounts of data in healthcare has become relevant for pain therapy and research. This poses problems for analyses with classical approaches, which is why artificial intelligence (AI) and machine learning (ML) methods are being included into pain research.
Methods: The current literature on AI and ML in the...
Three different Flow Cytometry datasets consisting of diagnostic samples of either peripheral blood (pB) or bone marrow (BM) from patients without any sign of bone marrow disease at two different health care centers are provided. In Flow Cytometry, each cell rapidly passes through a laser beam one by one, and two light scatter, and eight surface pa...
Background: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifica...
The use of artificial intelligence (AI) systems in biomedical and clinical settings can disrupt the traditional doctor-patient relationship, which is based on trust and transparency in medical advice and therapeutic decisions. When the diagnosis or selection of a therapy is no longer made solely by the physician, but to a significant extent by a ma...
Clustering is an important task in knowledge discovery with the goal to identify structures of similar data points in a dataset. Here, the focus lies on methods that use a human-in-the-loop, i.e., incorporate user decisions into the clustering process through 2D and 3D displays of the structures in the data. Some of these interactive approaches fal...
Motivation
The size of today’s biomedical data sets pushes computer equipment to its limits, even for seemingly standard analysis tasks such as data projection or clustering. Reducing large biomedical data by downsampling is therefore a common early step in data processing, often performed as random uniform class-proportional downsampling. In this...
Typical state of the art flow cytometry data samples consists of measures of more than 100.000 cells in 10 or more features. AI systems are able to diagnose such data with almost the same accuracy as human experts. However, there is one central challenge in such systems: their decisions have far-reaching consequences for the health and life of peop...
Algorithms implementing populations of agents which interact with one another and sense their environment may exhibit emergent behavior such as self-organization and swarm intelligence. Here a swarm system, called Databionic swarm (DBS), is introduced which is able to adapt itself to structures of high-dimensional data characterized by distance and...
Background: Diminished sense of smell impairs the quality of life but olfactorily disabled people are hardly considered in measures of disability inclusion. We aimed to stratify perceptual characteristics and odors according to the extent to which they are perceived differently with reduced sense of smell, as a possible basis for creating olfactory...
Euclidean distance-optimized data transformation for multivariate mining of non-trivial biomedical data (EDO)
The understanding of water quality and its underlying processes is important for the protection of aquatic environments. With the rare opportunity of access to a domain expert, an explainable AI (XAI) framework is proposed that is applicable to multivariate time series. The XAI provides explanations that are interpretable by domain experts. In thre...
The understanding of water quality and its underlying processes is important for the protection of aquatic environments enabling the rare opportunity of access to a domain expert. Hence, an explainable AI (XAI) framework is proposed that is applicable to multivariate time series resulting in explanations that are interpretable by a domain expert. T...
One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or cont...
Projections are conventional methods of dimensionality reduction for information visualization used to transform high-dimensional data into low dimensional space. If the projection method restricts the output space to two dimensions, the result is a scatter plot. The goal of this scatter plot is to visualize the relative relationships between high-...
A non-parametric effect-size measure capturing changes in central tendency and data distribution shape.
Please use https://cran.r-project.org/package=ImpactEffectsize for latest version.
Motivation
Calculating the magnitude of treatment effects or of differences between two groups is a common task in quantitative science. Standard effect size measures based on differences, such as the commonly used Cohen's, fail to capture the treatment-related effects on the data if the effects were not reflected by the central tendency. The prese...
The Databionic swarm (DBS) is a flexible and robust clustering framework that consists of three independent modules: swarm-based projection, high-dimensional data visualization, and representation guided clustering. The first module is the parameter-free projection method Pswarm, which exploits concepts of self-organization and emergence, game theo...
Background: Data from biomedical measurements usually include many parameters (variables/features). To reduce efforts of data acquisition or to enhance comprehension, a feature selection method is proposed that combines the ranking of the relative importance of each parameter in random forests classifiers with an item categorization provided by com...
For high-dimensional datasets in which clusters are formed by both distance and density structures (DDS), many clustering algorithms fail to identify these clusters correctly. This is demonstrated for 32 clustering algorithms using a suite of datasets which deliberately pose complex DDS challenges for clustering. In order to improve the structure f...
The Databionic swarm (DBS) is a flexible and robust clustering framework that consists of three independent modules: swarm based projection, high-dimensional data visualization and representation guided clustering. The first module is the parameter-free projection method Pswarm, which exploits concepts of self-organization and emergence, game theor...
Abstract. The understanding of water quality and its underlying processes is important for the protection of aquatic environments. Here an explainable AI (XAI) based multivariate time series analytical framework is applied on high-frequency water quality measurements including nitrate and electrical conductivity and twelve other environmental param...
The Fundamental Clustering Problems Suite (FCPS) offers a variety of clustering challenges that any algorithm should be able to handle given real-world data. The FCPS consists of datasets with known a priori classifications that are to be reproduced by the algorithm. The datasets are intentionally created to be visualized in two or three dimensions...
In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report describes a collection of data sets that are grouped tog...
Algorithms implementing populations of agents which interact with one another and sense their environment may exhibit emergent behavior such as self-organization and swarm intelligence. Here a swarm system, called Databionic swarm (DBS), is introduced which is able to adapt itself to structures of high-dimensional data characterized by distance and...
Finding subgroups in biomedical data is a key task in biomedical research and precision medicine. Already one-dimensional data, such as many different readouts from cell experiments, preclinical or human laboratory experiments or clinical signs, often reveal a more complex distribution than a single mode. Gaussian mixtures play an important role in...
Calculating the magnitude of treatment effects or of differences between two groups is a common task in quantitative science. Standard effect size measures based on differences, such as the commonly used Cohen's, fail to capture the treatment-related effects on the data if the effects were not reflected by the central tendency. "Impact” is a novel...
Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and finally the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require...
Type Package Title Calculation and Visualization of the Impact Effect Size Measure Description A non-parametric effect size measure capturing changes in central tendency or shape of data distributions for feature selection preceding machine-learning. The package provides the necessary functions to calculate and plot the Impact effect size measure b...
Background:
Persistent pain extending beyond 6 months after breast cancer surgery when adjuvant therapies have ended is a recognised phenomenon. The evolution of postsurgery pain is therefore of interest for future patient management in terms of possible prognoses for distinct groups of patients to enable better patient information.
Objective(s):...
One aim of data mining is the identification of interesting structures in data. The basic properties of an empirical distribution, such as skewness and eventual clipping, i.e., hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to differe...
This paper presents a systematic approach for discovering comprehensible, valid, potentially innovative and useful structures in multivariate municipality data. Techniques from statistics, machine learning and data mining are applied in logical consecutive steps. This allows the validation after each step and the generation of important results dur...
Fits Gaussian mixtures by applying evolution. As fitness function a mixture of the chi square test for distributions and a novel measure for approximating the common area under curves between multiple Gaussians is used. The package presents an alternative to the commonly used likelihood maximisation as is used in Expectation maximisation.
An interactive tool to optimize the parameters of a GMM, called “AdaptGauss”, was realized using the freely available R software package (version 3.2.0 for Windows / version 3.2.1 for Linux; http://CRAN.R-project.org/). The newly devolved R library “AdaptGauss” is freely available at https://cran.r-project.org/web/packages/AdaptGauss/index.html. Fo...
Motivation:
The genetic architecture of diseases becomes increasingly known. This raises difficulties in picking suitable targets for further research among an increasing number of candidates. Although expression based methods of gene set reduction are applied to laboratory-derived genetic data, the analysis of topical sets of genes gathered from...
Based on increasing evidence suggesting that MS pathology involves alterations in bioactive lipid metabolism , the present analysis was aimed at generating a complex serum lipid-biomarker. Using unsu-pervised machine-learning, implemented as emergent self-organizing maps of neuronal networks, swarm intelligence and Minimum Curvilinear Embedding, a...
Differential induction therapy of all subtypes of Acute Myeloid Leukemia other than Acute Promyelocytic Leukemia is impeded by the long time required to complete complex and diverse cytogenetic and molecular genetic analyses for risk stratification or targeted treatment decisions. Here, we describe a reliable, rapid and sensitive diagnostic approac...
Background: Prevention of persistent pain following breast cancer surgery, via early identification of patients at high risk, is a clinical need. Supervised machine-learning was used to identify parameters that predict persistence of significant pain.
Methods: Over 500 demographic, clinical and psychological parameters were acquired up to 6 months...
Human activities modify the global nitrogen cycle, mainly through farming. These practices have unintended
consequences; for example, nitrate lost from terrestrial runoff to streams and estuaries can impact aquatic life
[Aubert et al., 2016]. A greater understanding of water quality variations can improve the evaluation of the
state of water bodies...
Projections are conventional methods of dimensionality reduction for information visualization used to transform high-dimensional data into low dimensional space [1]. If the output space is restricted in the projection method to two dimensions, the result is a scatter plot. The goal of this scatter plot is a visualization of distance and density-ba...
Background: Data from biomedical measurements usually include many parameters (variables / features). To reduce efforts of data acquisition or to enhance comprehension, a feature selection method is proposed that combines the ranking of the relative importance of each parameter in random forests classifiers with an item categorization provided by c...
Background
Human genetic research has implicated functional variants of more than one hundred genes in the modulation of persisting pain. Artificial intelligence and machine learning techniques may combine this knowledge with results of genetic research gathered in any context, which permits the identification of the key biological processes involv...
The methods and possibilities of data mining for knowledge discovery in economic data are demonstrated on data of the German system of allocating tax revenues to municipalities. This system is complex and not easily understandable due to the involvement of several layers of administration and legislation. The general aim of the system is that a sha...