About
399
Publications
94,792
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,773
Citations
Citations since 2017
Introduction
Knowledge Management in Bioinformatics
Additional affiliations
March 2014 - August 2014
March 2010 - September 2010
October 2003 - present
Publications
Publications (399)
Pancreatic neuroendocrine neoplasms (panNENs) are a rare yet diverse type of neoplasia whose precise clinical–pathological classification is frequently challenging. Since incorrect classifications can affect treatment decisions, additional tools which support the diagnosis, such as machine learning (ML) techniques, are critically needed but general...
A time series is a sequence of sequentially ordered real values in time. Time series classification (TSC) is the task of assigning a time series to one of a set of predefined classes, usually based on a model learned from examples. Dictionary-based methods for TSC rely on counting the frequency of certain patterns in time series and are important c...
The identification of chemical–protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benc...
Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have hig...
The study of natural and human-made processes often results in long sequences of temporally-ordered values, aka time series (TS). Such processes often consist of multiple states, e.g. operating modes of a machine, such that state changes in the observed processes result in changes in the distribution of shape of the measured values. Time series seg...
Plain Language Summary
Earthquakes are among the most destructive natural hazards known to humankind. While earthquakes can not be predicted, it is possible to record them in real‐time and provide warnings to locations that the shaking has not reached yet. Warning times usually range from few seconds to tens of seconds. For very large earthquakes,...
High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however...
A motif intuitively is a short time series that repeats itself approximately the same within a larger time series. Such motifs often represent concealed structures, such as heart beats in an ECG recording, or sleep spindles in EEG sleep data. Motif discovery (MD) is the task of finding such motifs in a given input series. As there are varying defin...
Machine learning (ML) approaches have demonstrated the ability to predict molecular spectra at a fraction of the computational cost of traditional theoretical chemistry methods while maintaining high accuracy. Graph neural networks (GNNs) are particularly promising in this regard, but different types of GNNs have not yet been systematically compare...
Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefin...
Ruptures of the largest earthquakes can last between a few seconds and several minutes. An early assessment of the final earthquake size is essential for early warning systems. However, it is still unclear when in the rupture history this final size can be predicted. Here we introduce a probabilistic view of rupture evolution - how likely is the ev...
Background
Pancreatic neuroendocrine neoplasms (PanNENs) fall into two subclasses: the well-differentiated, low- to high-grade pancreatic neuroendocrine tumors (PanNETs), and the poorly-differentiated, high-grade pancreatic neuroendocrine carcinomas (PanNECs). While recent studies suggest an endocrine descent of PanNETs, the origin of PanNECs remai...
Tables are a common way to present information in an intuitive and concise manner. They are used extensively in media such as scientific articles or web pages. Automatically analyzing the content of tables bears special challenges. One of the most basic tasks is determination of the orientation of a table: In column tables, columns represent one en...
Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification proble...
Today’s scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating...
The detection of chemical-protein interactions is an important task with applications in drug design and biotechnology. The BioCreative VII - DrugProt shared task provides a benchmark for the automated extraction of such relations from scientific text. This article describes the Humboldt approach to solving it. We define the task as a relation clas...
Modern Earth Observation (EO) often analyses hundreds of gigabytes of data from thousands of satellite images. This data usually is processed with hand-made scripts combining several tools implementing the various steps within such an analysis. A fair amount of geographers' work goes into optimization, tuning, and parallelization in such a setting....
The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projec...
Background:
The clinical management of high-grade gastroenteropancreatic neuroendocrine neoplasms (GEP-NEN) is challenging due to disease heterogeneity, illustrating the need for reliable biomarkers facilitating patient stratification and guiding treatment decisions. FMS-like tyrosine kinase 3 ligand (Flt3L) is emerging as a prognostic or predicti...
Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale H...
High-throughput technologies have led to a continuously growing amount of information about regulatory features in the genome. A wealth of data generated by large international research consortia is available from online databases. Disease-driven studies provide details on specific DNA elements or epigenetic modifications regulating gene expression...
Objective
We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available resear...
Precise real time estimates of earthquake magnitude and location are essential for early warning and rapid response. While recently multiple deep learning approaches for fast assessment of earthquakes have been proposed, they usually rely on either seismic records from a single station or from a fixed set of seismic stations. Here we introduce a ne...
Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate, and be robust towards variations in text genre and style. We present HunFlair, an NER tagger fulfilling these requirements. HunFlair is integrated into the widely-u...
Precise real time estimates of earthquake magnitude and location are essential for early warning and rapid response. While recently multiple deep learning approaches for fast assessment of earthquakes have been proposed, they usually rely on either seismic records from a single station or from a fixed set of seismic stations. Here we introduce a ne...
Earthquakes are major hazards to humans, buildings and infrastructure. Early warning methods aim to provide advance notice of incoming strong shaking to enable preventive action and mitigate seismic risk. Their usefulness depends on accuracy, the relation between true, missed and false alerts, and timeliness, the time between a warning and the arri...
Tables are a common way to present information in an intuitive and concise manner. They are used extensively in media such as scientific articles or web pages. Automatically analyzing the content of tables bears special challenges. One of the most basic tasks is determination of the orientation of a table: In column tables, columns represent one en...
Earthquake early warning aims to provide advance notice of incoming strong shaking to enable preventive action and mitigate seismic hazard. Its usefulness depends on accuracy, the relation between true, missed and false alerts, and timeliness, the time between a warning and the arrival of strong shaking. Here we propose a novel early warning method...
In this paper, we investigate the computation of alternative paths between two locations in a road network. More specifically, we study the k-shortest paths with limited overlap (\(k\text {SPwLO}\)) problem that aims at finding a set of k paths such that all paths are sufficiently dissimilar to each other and as short as possible. To compute \(k\te...
Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification proble...
Tables are a popular and efficient means of presenting structured information. They are used extensively in various kinds of documents including web pages. Tables display information as a two-dimensional matrix, the semantics of which is conveyed by a mixture of structure (rows, columns), headers, caption, and content. Recent research has started t...
Named Entity Recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, highly accurate, and robust towards variations in text genre and style. To this end, we propose HunFlair, an NER tagger covering multiple entity types integrated into the widely used N...
Motivation:
The automatic extraction of published relationships between molecular entities has important applications in many biomedical fields, ranging from Systems Biology to Personalized Medicine. Existing works focused on extracting relationships described in single articles or in single sentences. However, a single record is rarely sufficient...
Lesion-based targeting strategies underlie cancer precision medicine. However, biological principles – such as cellular senescence – remain difficult to implement in molecularly informed treatment decisions. Functional analyses in syngeneic mouse models and cross-species validation in patient datasets might uncover clinically relevant genetics of b...
Motivation:
A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help t...
Pancreatic Neuroendocrine Neoplasms (PanNENs) comprise a rare and heterogeneous group of tumors derived from neuroendocrine cells of the pancreas. Despite genetic and epigenetic characterization, biomarkers for improved patient stratification and personalized therapy are sparse and targeted therapies, including the mTOR inhibitor Everolimus, have s...
The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a work...
Patents are an important source of information in industry and academia. However, quickly grasping the essence of a given patent is difficult as they typically are very long and written in a rather inaccessible style. These essential information, especially the invention itself and the experimental part of the invention, are usually contained in th...
Magnitude estimation is a central task in seismology needed for a wide spectrum of applications ranging from seismicity analysis to rapid assessment of earthquakes. However, magnitude estimates at individual stations show significant variability, mostly due to propagation effects, radiation pattern and ambient noise. To obtain reliable and precise...
Background:
Diagnosis and treatment decisions in cancer increasingly depend on a detailed analysis of the mutational status of a patient's genome. This analysis relies on previously published information regarding the association of variations to disease progression and possible interventions. Clinicians to a large degree use biomedical search eng...
Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification proble...
In this paper we present our contribution to the CLEF eHealth challenge 2019, Task 1. The task involves the automatic annotation of German non-technical summaries of animal experiments with ICD-10 codes. We approach the task as multi-label classification problem and leverage the multilingual version of the BERT text encoding model [6] to represent...
We present the BB-Tree, a fast and space-efficient index structure for processing multidimensional read/write workloads in main memory. The BB-Tree uses a k-ary search tree for pruning and searching while keeping all data in leaf nodes. It linearizes the inner search tree and manages it in a cache-optimized array , creating the need for occasional...
A scientific workflow is a set of interdependent compute tasks orchestrating large scale data analyses or in-silico experiments. Workflows often comprise thousands of tasks with heterogeneous resource requirements that need to be executed on distributed resources. Many workflow engines solve paralleliza-tion by submitting tasks to a batch schedulin...
The Earth's surface is continuously observed by satellites, leading to large multi-spectral image data sets of increasing spatial resolution and temporal density. One important application of satellite data is the mapping of land cover and land use changes such as urbanization, deforestation, and desertification. This information should be obtained...
In scientific computing, scheduling tasks with heterogeneous resource requirements still requires user estimates, which tend to be inaccurate in spite of laborious manual processes used to derive them. In this paper, we show that machine learning outperforms user estimates and models trained at runtime can be used to improve the resource allocation...
We present the BB-Tree, a fast and space-efficient index structure for processing multidimensional workloads in main memory. It uses a k-ary search tree for pruning and searching while keeping all data in leaf nodes. It linearizes the inner search tree and manages it in a cache-optimized array, with occasional reorganizations when data changes. To...
Detection of epithelial ovarian cancer (EOC) poses a critical medical challenge. However, novel biomarkers for diagnosis remain to be discovered. Therefore, innovative approaches are of the utmost importance for patient outcome. Here, we present a concept for blood-based biomarker discovery, investigating both epithelial and specifically stromal co...
Numerous methods have been developed trying to infer actual regulatory events in a sample. A prominent class of methods model genome-wide gene expression as linear equations derived from a transcription factor (TF) – gene network and optimizes parameters to fit the measured expression intensities. We apply four such methods on experiments with a TF...
Motivation:
Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in t...
Rule-based models are attractive for various tasks because they inherently lead to interpretable and explainable decisions and can easily incorporate prior knowledge. However, such systems are difficult to apply to problems involving natural language, due to its linguistic variability. In contrast, neural models can cope very well with ambiguity by...
PURPOSE
Precision oncology depends on the availability of up-to-date, comprehensive, and accurate information about associations between genetic variants and therapeutic options. Recently, a number of knowledge bases (KBs) have been developed that gather such information on the basis of expert curation of the scientific literature. We performed a q...
In many domains, the previous decade was characterized by increasing data volumes and growing complexity of data analyses, creating new demands for batch processing on distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e.g....
Cancer cell lines (CCL) are an integral part of modern cancer research but are susceptible to misidentification. The increasing popularity of sequencing technologies motivates the in-silico identification of CCLs based on their mutational fingerprint, but care must be taken when identifying heterogeneous data. We recently developed the proof-of-con...