About
100
Publications
39,046
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,859
Citations
Publications
Publications (100)
Managing complex AI systems requires insight into a model's decision-making processes. Understanding how these systems arrive at their conclusions is essential for ensuring reliability. In the field of explainable natural language processing, many approaches have been developed and evaluated. However, experimental analysis of explainability for tex...
Background
Psoriasis (Pso) is one of the most common chronic inflammatory skin diseases in Europe. Psoriatic arthritis (PsA) is closely associated to Pso. Up to 30% of the Pso patients will develop PsA during skin disease course. Defined and validated approaches for early detection are still missing. Beside biomarkers from blood or imaging, clinica...
The COVID-19 pandemic and the high numbers of infected individuals pose major challenges for public health departments. To overcome these challenges, the health department in Cologne has developed a software called DiKoMa. This software offers the possibility to track contact and index persons, but also provides a digital symptom diary. In this wor...
Methods from explainable machine learning are increasingly applied. However, evaluation of these methods is often anecdotal and not systematic. Prior work has identified properties of explanation quality and we argue that evaluation should be based on them. In this position paper, we provide an evaluation process that follows the idea of property t...
The definitive diagnosis and early treatment of many immune-mediated inflammatory diseases (IMIDs) is hindered by variable and overlapping clinical manifestations. Psoriatic arthritis (PsA), which develops in ~30% of people with psoriasis, is a key example. This mixed-pattern IMID is apparent in entheseal and synovial musculoskeletal structures, bu...
Zusammenfassung
Hintergrund und Ziele
Schon in der frühen Phase der global sehr verschieden verlaufenden COVID-19-Pandemie zeigten sich Hinweise auf den Einfluss sozioökonomischer Faktoren auf die Ausbreitungsdynamik der Erkrankung, die vor allem ab der zweiten Phase (September 2020) Menschen mit geringerem sozioökonomischen Status stärker betraf....
Deployment of modern data-driven machine learning methods, most often realized by deep neural networks (DNNs), in safety-critical applications such as health care, industrial plant control, or autonomous driving is highly challenging due to numerous model-inherent shortcomings. These shortcomings are diverse and range from a lack of generalization...
Background
Diagnosis and treatment of PsA and axSpA is often delayed due to missing clear diagnostic criteria and limitations in resources for referral to rheumatologist including high numbers of incorrect referrals. Primary care is usually provided by either general practitioner, dermatologists, or orthopedics. Clinical discriminators with a high...
The use of deep neural networks (DNNs) in safety-critical applications like mobile health and autonomous driving is challenging due to numerous model-inherent shortcomings. These shortcomings are diverse and range from a lack of generalization over insufficient interpretability to problems with malicious inputs. Cyber-physical systems employing DNN...
Statistical models are inherently uncertain. Quantifying or at least upper-bounding their uncertainties is vital for safety-critical systems such as autonomous vehicles. While standard neural networks do not report this information, several approaches exist to integrate uncertainty estimates into them. Assessing the quality of these uncertainty est...
In addition to objective indicators (e.g. laboratory values), clinical data often contain subjective evaluations by experts (e.g. disease severity assessments). While objective indicators are more transparent and robust, the subjective evaluation contains a wealth of expert knowledge and intuition. In this work, we demonstrate the potential of pair...
Visual analytics is a research discipline that is based on acknowledging the power and the necessity of the human vision, understanding, and reasoning in data analysis and problem solving. It develops a methodology of analysis that facilitates human activities by means of interactive visual representations of information. By examples from the domai...
Ageing is associated with a decline in physical activity and a decrease in the ability to perform activities of daily living, affecting physical and mental health. Elderly people or patients could be supported by a human activity recognition (HAR) system that monitors their activity patterns and intervenes in case of change in behavior or a critica...
The performance of modern relation extraction systems is to a great degree dependent on the size and quality of the underlying training corpus and in particular on the labels. Since generating these labels by human annotators is expensive, Distant Supervision has been proposed to automatically align entities in a knowledge base with a text corpus t...
Due to the advances in the digitalization process of the manufacturing industry and the resulting available data, there is tremendous progress and large interest in integrating machine learning and optimization methods on the shop floor in order to improve production processes. Additionally, a shortage of resources leads to increasing acceptance of...
In Medizin und Gesundheitswesen sind immer größere Mengen immer vielfältigerer Daten verfügbar, die zunehmend schneller generiert werden. Dieser allgemeine Trend wird als Big Data bezeichnet. Die Analyse von Big Data mit Methoden des maschinellen Lernens führt zur Entwicklung innovativer Lösungen, die neue medizinische Einsichten generieren und die...
In air traffic management and control, movement data
describing actual and planned flights are used for planning,
monitoring and post-operation analysis purposes with the goal of
increased efficient utilization of air space capacities (in terms of
delay reduction or flight efficiency), without compromising the
safety of passengers and cargo, nor ti...
In this chapter, we present the results of our research on utilizing social media analytics for community policing.
Social media are a valuable resource for LEAs to monitor happenings in their local communities. A 2013 survey of the International Association of chiefs of police in the US has shown, that 95.9% of surveyed agencies use social media...
Scarcity of labeled data is one of the most frequent problems faced in machine learning. This is particularly true in relation extraction in text mining, where large corpora of texts exists in many application domains, while labeling of text data requires an expert to invest much time to read the documents. Overall, state-of-the art models, like th...
Nowadays social media mining is broadly used in the security sector to support law enforcement and to increase response time in emergency situations. One approach to go beyond the manual inspection is to use text mining technologies to extract latent topics, analyze their geospatial distribution and to identify the sentiment from posts. Although wi...
In the first hours of a disaster, up-to-date information about the area of interest is crucial for effective disaster management. However, due to the delay induced by collecting and analysing satellite imagery, disaster management systems like the Copernicus Emergency Management Service (EMS) are currently not able to provide information products u...
Healthcare is one of the business fields with the highest Big Data potential. According to the prevailing definition, Big Data refers to the fact that data today is often too large and heterogeneous and changes too quickly to be stored, processed, and transformed into value by previous technologies. The technological trends drive Big Data: business...
Biomedical research becomes increasingly multidisciplinary and collaborative in nature. At the same time, it has recently seen a vast growth in publicly and instantly available information. As the available resources become more specialized, there is a growing need for multidisciplinary collaborations between biomedical researchers to address compl...
Real world problems in society, science or economics need human structuring, interpretation and decision making, the limiting factor being the amount of time and effort that the user can invest in the sense-making process. The Dicode data mining services intend to help in clearly defined steps of the sense-making process, where human capacity is mo...
This chapter reports on practical lessons learned while developing the Dicode’s data mining services and using them in data-intensive and cognitively-complex settings. Various sources were taken into consideration to establish these lessons, including user feedbacks obtained from evaluation studies, discussion in teams, as well as observation of se...
Usability testing methods are nowadays integrated into the design and development of health-care software, and the need for usability in health-care information technology (IT) is widely accepted by clinicians and researchers. Usability assessment starts with the identification of specific objectives that need to be tested and continues with the de...
This paper outlines the major components and function of the Technologically Integrated Oncosimulator developed primarily within the ACGT (Advancing Clinico Genomic Trials on Cancer) project. The Oncosimulator is defined as an information technology system simulating in vivo tumor response to therapeutic modalities within the clinical trial context...
Clinical decision support (CDS) systems promise to improve the quality of clinical care by helping physicians to make better, more informed decisions efficiently. However, the design and testing of CDS systems for practical medical use is cumbersome. It has been recognized that this may easily lead to a problematic mismatch between the developers'...
Medicine is undergoing a revolution that is even transforming the nature of health care from reactive to proactive. To serve these new and diverse needs, bioinformatics and data mining are teaming up to generate tools and procedures for prediction of disease recurrence and progression, response to treatment, as well as new insights into various onc...
Biomedical research has recently seen a vast growth in publicly and instantly available information, which are often complementary or overlapping. As the available resources become more specialized, there is a growing need for multidisciplinary collaborations between biomedical researchers to address complex research questions. We present an applic...
Supervised local pattern discovery aims to find subsets of a database with a high statistical unusualness in the distribution of a target attribute. Local pattern discovery is often used to generate a human-understandable representation of the most interesting dependencies in a data set. Hence, the more crisp and concise the output is, the better....
Contemporary collaboration settings are often associated with huge, ever-increasing amounts of data, which may vary in terms of relevance, subjectivity and importance. In such settings, collective sense making is crucial for well-informed decision making. This sense making process may both utilize and provide input to intelligent information analys...
The objective of the ACGT (Advancing Clinico-Genomic Trials on Cancer: Open Grid Services for improving Medical Knowledge Discovery, www.eu-acgt.org) project that has recently concluded successfully was the development of a semantically rich infrastructure facilitating seamless and secure access and analysis, of multi-level clinical and genomic dat...
Supervised descriptive rule discovery techniques like subgroup discovery are quite popular in applications like fraud detection or clinical studies. Compared with other descriptive techniques, like classical support/confidence association rules, subgroup discovery has the advantage that it comes up with only the top-k patterns, and that it makes us...
Bioinformatics and data mining procedures are collaborating to implement and evaluate tools and procedures for prediction of disease recurrence and progression, response to treatment, as well as new insights into various oncogenic pathways [1], [2], [3], [4] by taking into account the user needs and their heterogeneity. Based on these advances, med...
Today's business applications demand high flexibility in processing information and extracting knowledge from data. Thus, data mining becomes more and more an integral part of operating a business. However, the integration of data mining into business processes still requires a lot of coordination and manual adjustment. This paper aims at reducing...
The Worldwide innovative Networking in personalized cancer medicine (WIN) initiated by the Institute Gustave Roussy (France) and The University of Texas MD Anderson Cancer Center (USA) has dedicated its 3rd symposium (Paris, 6–8 July 2011) to discussion on gateways to increase the efficacy of cancer diagnostics and therapeutics (http://www.winconso...
This paper reports on a hybrid approach aiming to facilitate and augment collaboration and decision making in data-intensive
and cognitively-complex settings. The proposed approach exploits and builds on the most prominent high-performance computing
paradigms and large data processing technologies to meaningfully search, analyze and aggregate data...
The challenges regarding seamless integration of distributed, heterogeneous and multilevel data arising in the context of contemporary, post-genomic clinical trials cannot be effectively addressed with current methodologies. An urgent need exists to access data in a uniform manner, to share information among different clinical and research centers,...
Supervised descriptive rule discovery techniques like subgroup discovery are quite popular in applications like fraud detection
or clinical studies. Compared with other descriptive techniques, like classical support/confidence association rules, subgroup
discovery has the advantage that it comes up with only the top-k patterns, and that it makes u...
Today's business applications demand high flexibility in processing information and extracting knowledge from data. Thus, data mining becomes more and more an integral part of operating a business. However, the integration of data mining into business processes still requires a lot of coordination and manual adjustment. This paper aims at reducing...
This short paper provides a brief outline of the main components and the developmental and translational process of the ACGT Oncosimulator. The Oncosimulator is an integrated software system simulating in vivo tumor response to therapeutic modalities within the clinical trial environment. It aims at supporting patient individualized optimization of...
Integrating data mining into business processes becomes cru- cial for business today. Modern business process management frameworks provide great support for exible design, deployment and management of business processes. However, integrating complex data mining services into such frameworks is not trivial due to unclear denitions of user roles and...
A learning problem that has only recently gained attention in the machine learning community is that of learning a classifier from group probabilities. It is a learning task that lies somewhere between the well-known tasks of supervised and unsupervised learning, in the sense that for a set of observations we do not know the labels, but for some gr...
Workflow enacting systems are a popular technology in business and e-science alike to flexibly define and enact complex data processing tasks. Since the construction of a workflow for a specific task can become quite complex, efforts are currently underway to increase the re-use of workflows through the implementation of specialized workflow reposi...
Public healthcare is a basic service provided by governments to citizens which is increasingly coming under pressure as the European population ages and the ratio of working to elderly persons falls. A way to make public spending on healthcare more efficient is to ensure that the money is spent on legitimate causes. This paper presents the work of...
Subgroup discovery is a Knowledge Discovery task that aims at finding subgroups of a population with high generality and distributional
unusualness. While several subgroup discovery algorithms have been presented in the past, they focus on databases with nominal
attributes or make use of discretization to get rid of the numerical attributes. In thi...
Subgroup discovery is the task of identifying the top k patterns in a database with most significant deviation in the distribution of a target attribute Y. Subgroup discovery is a popular approach for identifying interesting patterns in data, because it combines statistical significance with an understandable representation of patterns as a logical...
This paper describes an approach for selecting instances in regression
problems in the cases where observations x are readily available, but obtaining
labels y is hard. Given a database of observations, an algorithm
inspired by statistical design of experiments and kernel methods is presented
that selects a set of k instances to be chosen in order...
Today, most of the data in business applications is stored in relational databases. Relational database systems are so popular, because they offer solutions to many problems around data storage, such as efficiency, effectiveness, usability, security and multi-user support. To benefit from these advantages in Support Vector Machine (SVM) learning, w...
In this paper, we describe an analysis tool based on the statistical environment R, GridR, which allows using the collection of methodologies available as R packages in a grid environment. It provides the user with transparent and seamless access to large-scale distributed computational services and data repositories within the secure and reliable...
Grid technologies have proven to be very successful in the area of eScience, and healthcare in particular, because they allow to easily combine proven solutions for data querying, integration, and analysis into a secure, scalable framework. In order to integrate the services that implement these solutions into a given Grid architecture, some metada...
Grid technologies have proven to be very successful in the area of eScience, and in particular in healthcare applications. But while the applicability of workflow enacting tools for biomedical research has long since been proven, the practical adoption into regular clinical research has some additional challenges in grid context. In this paper, we...
The analysis of clinico-genomic data poses complex computa- tional problems. In the project ACGT, a grid-based software system to sup- port clinicians and bio-statisticians in their daily work is being developed. Starting with a detailed user requirements analysis, and with the continu- ous integration of usability analysis in the development proce...
In this paper, we describe an extension to the ACGT GridR environment which allows the parallelization of loops in R scripts in view of their distributed execution on a computational grid. The ACGT GridR service is extended by a component that uses a set of preprocessor-like directives to organize and distribute calculations. The use of paralleliza...
Subgroup discovery is the task of finding subgroups of a population which exhibit both distributional unusualness and high generality. Due to the non monotonicity of the corresponding evaluation functions, standard pruning techniques cannot be used for subgroup discovery, requiring the use of optimistic estimate techniques instead. So far, however,...
In this paper, we describe an analysis tool based on the statistical environment R, GridR, which allows using the collection of methodologies available as R packages in a grid environment. The aim of GridR, which was initiated in the context of the EU project Advancing Clinico-Genomics Trials on Cancer (ACGT), is to provide a powerful framework for...
This paper presents the architectural considerations of the Advancing Clinico-Genomic Trials on Cancer (ACGT) project aiming at delivering a European Biomedical Grid in support of efficient knowledge discovery in the context of post-genomic clinical trials on cancer. Our main research challenge in ACGT is the requirement to develop an infrastructur...
This paper describes an approach to detect risks of procure- ment fraud. It was developed within the context of a European Union project on fraud prevention. Procurement fraud is a special kind of fraud that occurs when employees cheat on their own employers by executing or triggering bogus payments. The approach presented here is based on the idea...
With the completion of the human genome and the entrance into the post-genomic era, translational research rises as a major need. In this paper, we present a knowledge discovery workflow (KDw) and its utilization in the context of clinico-genomic trials. KDw aims towards the discovery of 'evidential' correlations between patients' genomic and clini...
Life sciences are currently at the centre of an information revolution. The nature and amount of information now available opens up areas of research that were once in the realm of science fiction. During this information revolution, the data-gathering capabilities have greatly surpassed the data-analysis techniques. Data integration across heterog...
As the internet becomes more pervasive in all areas of human activity, attackers can use the anonymity of the cyberspace to commit crimes and compromise the IT infrastructure. As currently there is no generally implemented authentification technology we have to monitor the contents and relations of messages and internet traffic to detect infringeme...
Recent advances in research methods and technologies have resulted in an explosion of information and knowledge about cancers and their treatment. Knowledge Discovery (KD) is a key technique for dealing with this massive amount of data and the challenges of managing the steadily growing amount of available knowledge. In this paper, we present the A...
Knowledge discovery in clinico-genomic data is a task that requires to integrate not only highly heterogeneous kinds of data, but also the requirements and interests of very difierent user groups. Technologies of grid computing promise to be an efiective tool to combine all these requirements into a single architecture. In this paper, we describe s...
Interpretability is an important, yet often neglected criterion when applying machine learning algorithms to real-world tasks. An understandable model enables the user to gain more knowledge from his data and to participate in the knowledge discovery process in a more detailed way. Hence, learning interpretable models is a challenging task, whose c...
Probabilistic calibration is the task of producing reliable es- timates of the conditional class probability P(class|observation) from the outputs of numerical classifiers. A recent comparative study (1) re- vealed that Isotonic Regression (2) and Platt Calibration (3) are most effective probabilistic calibration technique for a wide range of class...
This paper investigates the use of Design of Experiments in observational studies in order to select informative observations and features
for classification. D-optimal plans are searched for in existing data and based on these plans the variables most relevant for classification
are determined. The adapted models are then compared with respect to...
We investigate methods to determine appropriate choices of the hyper-parameters for kernel based methods. Support vector classification, kernel logistic regression and support vector regression are considered. Grid search, Nelder-Mead algorithm
and pattern search algorithm are used.
Next to prediction accuracy, the interpretability of models is one of the fundamental criteria for machine learning algorithms. While high accuracy learners have intensively been explored, interpretability still poses a dicult problem, largely because it can hardly be formal- ized in a general way. To circumvent this problem, one can often nd a mod...
Support Vector Machines (SVMs) have become a popular learning algorithm, in particular for large, high-dimensional classification problems. SVMs have been shown to give most accurate classification results in a variety of applications. Several methods have been proposed to obtain not only a classification, but also an estimate of the SVMs confidenc...
The analysis of temporal data is an important issue in current research, because most real-world data either explicitly or implicitly contains some information about time. The key to successfully solving temporal learning tasks is to analyze the assumptions that can be made and prior knowledge one has about the temporal process of the learning prob...
Today, most of the data in business applications is stored in relational database systems or in data warehouses built on top of relational database systems. Often, for more data is available than can be processed by standard learning algorithms in reasonable time. This paper presents an extension to kernel algorithms that makes use of the more comp...
Support Vector Machines (SVMs) have become a popular tool for learning with large amounts of high dimensional data. However, it may sometimes be preferable to learn incrementally from previous SVM results, as computing a SVM is very costly in terms of time and memory consumption or because the SVM may be used in an online learning setting. In this...
Today, most of the data in business applications is stored in relational databases. Relational database systems are so popular, because they offer solutions to many problems around data storage, such as efficiency, effectiveness, usability, security and multi-user support. To benefit from these advantages in Support Vector Machine (SVM) learning, w...
A workbench for knowledge acquisition and data analysis is presented and its use for the classification of business cycles is investigated. Inductive Logic Programming (ILP) allows to model relations between intervals, e.g. time or value intervals. Moreover, the user of the workbench is supported in inspecting the learned rules, not only with respe...
The classification of business cycles is a hard and important problem. Government as well as business decisions rely on the assessment of the current business cycle. In this paper, we investigate how economists can be better supported by a combination of machine learning techniques. We have successfully applied Inductive Logic Programming (ILP). Fo...
Today, most of the data in business applications is stored in relational databases. Relational database systems are so popular, because they offer solutions to many problems around data storage, such as efficiency, effectiveness, usability, security and multi-user support. To benefit from these advantages in Support Vector Machine (SVM) learning, w...
Today, most of the data in business applications is stored in relational databases. Relational database systems are so popular, because they offer solutions to many problems around data storage, such as efficiency, effectiveness, usability, security and multi-user support. To benefit from these advantages in Support Vector Machine (SVM) learning, w...
Support Vector Machines (SVMs) have become a popular tool for learning with large amounts of high dimensional data. However, it may sometimes be preferable to learn incrementally from previous SVM results, as computing a SVM is very costly in terms of time and memory consumption or be- cause the SVM may be used in an online learning setting. In thi...
Time series analysis is an important and complex problem in machine learning and statistics. Real-world applications can consist of very large and high dimensional time series data. Support Vector Machines (SVMs) are a popular tool for the analysis of such data sets. This paper presents some SVM kernel functions and disusses their relative merits,...
Next to prediction accuracy, interpretability is one of the fundamental performance crite-ria for machine learning. While high accu-racy learners have intensively been explored, interpretability still poses a difficult problem. To combine accuracy and interpretability, this paper introduces an framework which combines an approximative model with a...
Local models are high-quality models of small regions of the input space of a learning problem. The advantage of local models is that they are often much more interesting and understandable to the domain expert, as they can concisely describe single aspects of the data instead of describing everything at once as global models do. This results in be...
In the last couple of years, the amount of data to be analyzed in different areas grows rapidly. Examples range from natural sciences (e.g. astronomy or particle physics), business data (e.g. a high increase use data volume is expected by the use of RFID technology), life sciences (such as high-throughput genomics and post-genomics technologies) or...
Subgroup discovery is the task of finding subgroups of a population which exhibit both distributional unusualness and high generality at the same time. Since the corresponding evaluation functions are not monotonic, the standard pruning techniques from monotonic problems such as frequent set discovery cannot be used. In this paper, we show that opt...