Ulf Leser

Ulf Leser
Humboldt-Universität zu Berlin | HU Berlin · Department of Computer Science

Prof. Dr.

About

399
Publications
94,792
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,773
Citations
Citations since 2017
120 Research Items
4046 Citations
20172018201920202021202220230200400600800
20172018201920202021202220230200400600800
20172018201920202021202220230200400600800
20172018201920202021202220230200400600800
Additional affiliations
March 2014 - August 2014
Free University of Bozen-Bolzano
Position
  • Professor
March 2010 - September 2010
Université Paris-Sud 11
Position
  • Professor
October 2003 - present
Humboldt-Universität zu Berlin
Position
  • Professor (Full)

Publications

Publications (399)
Article
Full-text available
Pancreatic neuroendocrine neoplasms (panNENs) are a rare yet diverse type of neoplasia whose precise clinical–pathological classification is frequently challenging. Since incorrect classifications can affect treatment decisions, additional tools which support the diagnosis, such as machine learning (ML) techniques, are critically needed but general...
Preprint
Full-text available
A time series is a sequence of sequentially ordered real values in time. Time series classification (TSC) is the task of assigning a time series to one of a set of predefined classes, usually based on a model learned from examples. Dictionary-based methods for TSC rely on counting the frequency of certain patterns in time series and are important c...
Article
Full-text available
The identification of chemical–protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benc...
Preprint
Scientific workflows typically comprise a multitude of different processing steps which often are executed in parallel on different partitions of the input data. These executions, in turn, must be scheduled on the compute nodes of the computational infrastructure at hand. This assignment is complicated by the facts that (a) tasks typically have hig...
Preprint
The study of natural and human-made processes often results in long sequences of temporally-ordered values, aka time series (TS). Such processes often consist of multiple states, e.g. operating modes of a machine, such that state changes in the observed processes result in changes in the distribution of shape of the measured values. Time series seg...
Article
Full-text available
Plain Language Summary Earthquakes are among the most destructive natural hazards known to humankind. While earthquakes can not be predicted, it is possible to record them in real‐time and provide warnings to locations that the shaking has not reached yet. Warning times usually range from few seconds to tens of seconds. For very large earthquakes,...
Article
Full-text available
High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however...
Preprint
A motif intuitively is a short time series that repeats itself approximately the same within a larger time series. Such motifs often represent concealed structures, such as heart beats in an ECG recording, or sleep spindles in EEG sleep data. Motif discovery (MD) is the task of finding such motifs in a given input series. As there are varying defin...
Article
Machine learning (ML) approaches have demonstrated the ability to predict molecular spectra at a fraction of the computational cost of traditional theoretical chemistry methods while maintaining high accuracy. Graph neural networks (GNNs) are particularly promising in this regard, but different types of GNNs have not yet been systematically compare...
Preprint
Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefin...
Preprint
Full-text available
Ruptures of the largest earthquakes can last between a few seconds and several minutes. An early assessment of the final earthquake size is essential for early warning systems. However, it is still unclear when in the rupture history this final size can be predicted. Here we introduce a probabilistic view of rupture evolution - how likely is the ev...
Article
Full-text available
Background Pancreatic neuroendocrine neoplasms (PanNENs) fall into two subclasses: the well-differentiated, low- to high-grade pancreatic neuroendocrine tumors (PanNETs), and the poorly-differentiated, high-grade pancreatic neuroendocrine carcinomas (PanNECs). While recent studies suggest an endocrine descent of PanNETs, the origin of PanNECs remai...
Article
Full-text available
Tables are a common way to present information in an intuitive and concise manner. They are used extensively in media such as scientific articles or web pages. Automatically analyzing the content of tables bears special challenges. One of the most basic tasks is determination of the orientation of a table: In column tables, columns represent one en...
Article
Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification proble...
Article
Today’s scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating...
Conference Paper
The detection of chemical-protein interactions is an important task with applications in drug design and biotechnology. The BioCreative VII - DrugProt shared task provides a benchmark for the automated extraction of such relations from scientific text. This article describes the Humboldt approach to solving it. We define the task as a relation clas...
Conference Paper
Full-text available
Modern Earth Observation (EO) often analyses hundreds of gigabytes of data from thousands of satellite images. This data usually is processed with hand-made scripts combining several tools implementing the various steps within such an analysis. A fair amount of geographers' work goes into optimization, tuning, and parallelization in such a setting....
Preprint
Full-text available
The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projec...
Article
Full-text available
Background: The clinical management of high-grade gastroenteropancreatic neuroendocrine neoplasms (GEP-NEN) is challenging due to disease heterogeneity, illustrating the need for reliable biomarkers facilitating patient stratification and guiding treatment decisions. FMS-like tyrosine kinase 3 ligand (Flt3L) is emerging as a prognostic or predicti...
Preprint
Full-text available
Scientific workflows are a cornerstone of modern scientific computing, and they have underpinned some of the most significant discoveries of the last decade. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale H...
Article
Full-text available
High-throughput technologies have led to a continuously growing amount of information about regulatory features in the genome. A wealth of data generated by large international research consortia is available from online databases. Disease-driven studies provide details on specific DNA elements or epigenetic modifications regulating gene expression...
Article
Full-text available
Objective We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available resear...
Article
Precise real time estimates of earthquake magnitude and location are essential for early warning and rapid response. While recently multiple deep learning approaches for fast assessment of earthquakes have been proposed, they usually rely on either seismic records from a single station or from a fixed set of seismic stations. Here we introduce a ne...
Article
Full-text available
Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate, and be robust towards variations in text genre and style. We present HunFlair, an NER tagger fulfilling these requirements. HunFlair is integrated into the widely-u...
Preprint
Full-text available
Precise real time estimates of earthquake magnitude and location are essential for early warning and rapid response. While recently multiple deep learning approaches for fast assessment of earthquakes have been proposed, they usually rely on either seismic records from a single station or from a fixed set of seismic stations. Here we introduce a ne...
Article
Earthquakes are major hazards to humans, buildings and infrastructure. Early warning methods aim to provide advance notice of incoming strong shaking to enable preventive action and mitigate seismic risk. Their usefulness depends on accuracy, the relation between true, missed and false alerts, and timeliness, the time between a warning and the arri...
Article
Full-text available
Tables are a common way to present information in an intuitive and concise manner. They are used extensively in media such as scientific articles or web pages. Automatically analyzing the content of tables bears special challenges. One of the most basic tasks is determination of the orientation of a table: In column tables, columns represent one en...
Preprint
Full-text available
Earthquake early warning aims to provide advance notice of incoming strong shaking to enable preventive action and mitigate seismic hazard. Its usefulness depends on accuracy, the relation between true, missed and false alerts, and timeliness, the time between a warning and the arrival of strong shaking. Here we propose a novel early warning method...
Article
Full-text available
In this paper, we investigate the computation of alternative paths between two locations in a road network. More specifically, we study the k-shortest paths with limited overlap (\(k\text {SPwLO}\)) problem that aims at finding a set of k paths such that all paths are sufficiently dissimilar to each other and as short as possible. To compute \(k\te...
Article
Full-text available
Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification proble...
Preprint
Full-text available
Tables are a popular and efficient means of presenting structured information. They are used extensively in various kinds of documents including web pages. Tables display information as a two-dimensional matrix, the semantics of which is conveyed by a mixture of structure (rows, columns), headers, caption, and content. Recent research has started t...
Preprint
Full-text available
Named Entity Recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, highly accurate, and robust towards variations in text genre and style. To this end, we propose HunFlair, an NER tagger covering multiple entity types integrated into the widely used N...
Article
Motivation: The automatic extraction of published relationships between molecular entities has important applications in many biomedical fields, ranging from Systems Biology to Personalized Medicine. Existing works focused on extracting relationships described in single articles or in single sentences. However, a single record is rarely sufficient...
Article
Full-text available
Lesion-based targeting strategies underlie cancer precision medicine. However, biological principles – such as cellular senescence – remain difficult to implement in molecularly informed treatment decisions. Functional analyses in syngeneic mouse models and cross-species validation in patient datasets might uncover clinically relevant genetics of b...
Article
Full-text available
Motivation: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help t...
Preprint
Full-text available
Pancreatic Neuroendocrine Neoplasms (PanNENs) comprise a rare and heterogeneous group of tumors derived from neuroendocrine cells of the pancreas. Despite genetic and epigenetic characterization, biomarkers for improved patient stratification and personalized therapy are sparse and targeted therapies, including the mTOR inhibitor Everolimus, have s...
Preprint
Full-text available
The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a work...
Article
Patents are an important source of information in industry and academia. However, quickly grasping the essence of a given patent is difficult as they typically are very long and written in a rather inaccessible style. These essential information, especially the invention itself and the experimental part of the invention, are usually contained in th...
Article
Full-text available
Magnitude estimation is a central task in seismology needed for a wide spectrum of applications ranging from seismicity analysis to rapid assessment of earthquakes. However, magnitude estimates at individual stations show significant variability, mostly due to propagation effects, radiation pattern and ambient noise. To obtain reliable and precise...
Article
Full-text available
Background: Diagnosis and treatment decisions in cancer increasingly depend on a detailed analysis of the mutational status of a patient's genome. This analysis relies on previously published information regarding the association of variations to disease progression and possible interventions. Clinicians to a large degree use biomedical search eng...
Preprint
Full-text available
Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification proble...
Conference Paper
Full-text available
In this paper we present our contribution to the CLEF eHealth challenge 2019, Task 1. The task involves the automatic annotation of German non-technical summaries of animal experiments with ICD-10 codes. We approach the task as multi-label classification problem and leverage the multilingual version of the BERT text encoding model [6] to represent...
Conference Paper
Full-text available
We present the BB-Tree, a fast and space-efficient index structure for processing multidimensional read/write workloads in main memory. The BB-Tree uses a k-ary search tree for pruning and searching while keeping all data in leaf nodes. It linearizes the inner search tree and manages it in a cache-optimized array , creating the need for occasional...
Conference Paper
Full-text available
A scientific workflow is a set of interdependent compute tasks orchestrating large scale data analyses or in-silico experiments. Workflows often comprise thousands of tasks with heterogeneous resource requirements that need to be executed on distributed resources. Many workflow engines solve paralleliza-tion by submitting tasks to a batch schedulin...
Conference Paper
Full-text available
The Earth's surface is continuously observed by satellites, leading to large multi-spectral image data sets of increasing spatial resolution and temporal density. One important application of satellite data is the mapping of land cover and land use changes such as urbanization, deforestation, and desertification. This information should be obtained...
Conference Paper
Full-text available
In scientific computing, scheduling tasks with heterogeneous resource requirements still requires user estimates, which tend to be inaccurate in spite of laborious manual processes used to derive them. In this paper, we show that machine learning outperforms user estimates and models trained at runtime can be used to improve the resource allocation...
Conference Paper
Full-text available
We present the BB-Tree, a fast and space-efficient index structure for processing multidimensional workloads in main memory. It uses a k-ary search tree for pruning and searching while keeping all data in leaf nodes. It linearizes the inner search tree and manages it in a cache-optimized array, with occasional reorganizations when data changes. To...
Article
Full-text available
Detection of epithelial ovarian cancer (EOC) poses a critical medical challenge. However, novel biomarkers for diagnosis remain to be discovered. Therefore, innovative approaches are of the utmost importance for patient outcome. Here, we present a concept for blood-based biomarker discovery, investigating both epithelial and specifically stromal co...
Article
Full-text available
Numerous methods have been developed trying to infer actual regulatory events in a sample. A prominent class of methods model genome-wide gene expression as linear equations derived from a transcription factor (TF) – gene network and optimizes parameters to fit the measured expression intensities. We apply four such methods on experiments with a TF...
Article
Motivation: Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in t...
Preprint
Full-text available
Rule-based models are attractive for various tasks because they inherently lead to interpretable and explainable decisions and can easily incorporate prior knowledge. However, such systems are difficult to apply to problems involving natural language, due to its linguistic variability. In contrast, neural models can cope very well with ambiguity by...
Article
PURPOSE Precision oncology depends on the availability of up-to-date, comprehensive, and accurate information about associations between genetic variants and therapeutic options. Recently, a number of knowledge bases (KBs) have been developed that gather such information on the basis of expert curation of the scientific literature. We performed a q...
Article
In many domains, the previous decade was characterized by increasing data volumes and growing complexity of data analyses, creating new demands for batch processing on distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e.g....
Article
Full-text available
Cancer cell lines (CCL) are an integral part of modern cancer research but are susceptible to misidentification. The increasing popularity of sequencing technologies motivates the in-silico identification of CCLs based on their mutational fingerprint, but care must be taken when identifying heterogeneous data. We recently developed the proof-of-con...