Johannes KösterUniversity of Duisburg-Essen | uni-due · Institute of Human Genetics
Johannes Köster
Dr. rer. nat.
About
83
Publications
26,529
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,635
Citations
Introduction
I am computer scientist with a focus on algorithm engineering and data analysis in bioinformatics. Currently, I work as a Postdoctoral Research Fellow in the groups of Shirley Liu, Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health and Myles Brown, Division of Molecular and Cellular Oncology, Department of Medical Oncology, Dana-Farber Cancer Institute.
Additional affiliations
Education
January 2011 - March 2015
October 2005 - September 2010
Publications
Publications (83)
Multiplexed error-robust fluorescence in-situ hybridization (MERFISH) is a new technology to obtain spatially resolved gene or transcript expression profiles in single cells for hundreds to thousands of genes in parallel. We present a Bayesian model for expression analysis on MERFISH data. We show that the model successfully rescues systematic bias...
Neuroblastoma is a malignancy of the developing sympathetic nervous system that is often lethal when relapse occurs. We here used whole-exome sequencing, mRNA expression profiling, array CGH and DNA methylation analysis to characterize 16 paired samples at diagnosis and relapse from individuals with neuroblastoma. The mutational burden significantl...
The analysis of next-generation sequencing (NGS) data is a major topic in bioinformatics: short reads obtained from DNA, the molecule encoding the genome of living organisms, are processed to provide insight into biological or medical questions. This thesis provides novel solutions to major topics within the analysis of NGS data, focusing on parall...
Snakemake is a workflow engine that provides a readable Python-based workflow definition language and a powerful execution environment that scales from single-core workstations to compute clusters without modifying the workflow. It is the first system to support the use of automatically inferred multiple named wildcards (or variables) in input and...
Background
Identification of human leukocyte antigen (HLA) types from DNA-sequenced human samples is important in organ transplantation and cancer immunotherapy and remains a challenging task considering sequence homology and extreme polymorphism of HLA genes.
Results
We present Orthanq, a novel statistical model and corresponding application for...
Background
At a global scale, the SARS-CoV-2 virus did not remain in its initial genotype for a long period of time, with the first global reports of variants of concern (VOCs) in late 2020. Subsequently, genome sequencing has become an indispensable tool for characterizing the ongoing pandemic, particularly for typing SARS-CoV-2 samples obtained f...
Antibodies targeting the immune checkpoint molecules PD-1, PD-L1 and CTLA-4, administered alone or in combination with chemotherapy, are the standard of care in most patients with metastatic non-small-cell lung cancers. When given before curative surgery, tumor responses and improved event-free survival are achieved. New antibody combinations may b...
Background Identification of Human leukocyte antigen (HLA) types from DNA-sequenced human samples is important in organ transplantation and cancer immunotherapy and remains a challenging task considering sequence homology and extreme polymorphism of HLA genes.
Results We present Orthanq, a novel statistical model and corresponding application for t...
Background
Pancreatic ductal adenocarcinoma (PDAC) is an aggressive cancer with poor prognosis. It is marked by extraordinary resistance to conventional therapies including chemotherapy and radiation, as well as to essentially all targeted therapies evaluated so far. More than 90% of PDAC cases harbor an activating KRAS mutation. As the most common...
The past decade has seen unprecedented progress in the survival chances of cancer patients as a consequence of new treatments targeting tumor-specific cellular processes, which have been uncovered by molecular genetic analyses. From a data analysis perspective, the main challenge is the high dimensionality and multimodality of the genetic data rela...
Proteins have manifold functions in living cells, including structural integrity, transport, defense against pathogens, or message transmission, to name but a few. Recent advances in Machine Learning appear to have solved the protein folding problem, i.e., how to obtain the three-dimensional functional protein structure from the amino acid sequence...
We present vembrane as a command line VCF/BCF filtering tool that consolidates and extends the filtering functionality of previous software to meet any imaginable filtering use case. Vembrane exposes the VCF/BCF file type specification and its inofficial extensions by the annotation tools VEP and SnpEff as Python data structures. vembrane filter en...
Data from sequencing of DNA or RNA samples is routinely scanned for variation. Such variation data is stored in the standardized VCF/BCF format with additional annotations. Analyses of variants usually involve steps where filters are applied to narrow down the list of candidates for further analysis. A number of tools for this task exist, differing...
Occurrence of extra-chromosomal circular DNA is a phenomenon frequently observed in tumor cells, and the presence of such DNA has been recognized as a marker of adverse outcome across cancer types. We here describe a computational workflow for identification of DNA circles from long-read sequencing data. The workflow is implemented based on the Sna...
Pancreatic ductal adenocarcinoma (PDAC) is an aggressive cancer with poor prognosis. Drug resistance is the major cause for therapeutic failure in PDAC patients with progressive disease. The mechanisms underlying resistance formation are complex and remain poorly understood.
To gain insights into molecular changes during the formation of resistance...
Accurate single cell mutational profiles can reveal genomic cell-to-cell heterogeneity. However, sequencing libraries suitable for genotyping require whole genome amplification, which introduces allelic bias and copy errors. The resulting data violates assumptions of variant callers developed for bulk sequencing. Thus, only dedicated models account...
Motivation
Haplotype phasing approaches have been shown to improve accuracy of the search space of neoantigen prediction by determining if adjacent variants are located on the same chromosomal copy. However, the aneuploid nature of cancer cells as well as the admixture of different clones in tumor bulk sequencing data are challenging the current di...
Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and...
The rapid increase in the amount of genomic data provides researchers with an opportunity to integrate diverse datasets and annotations when addressing a wide range of biological questions. However, genomic datasets are deposited on different platforms and are stored in numerous formats from multiple genome builds, which complicates the task of col...
Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and...
Objective
Current predictive biomarkers for PD-1 (programmed cell death protein 1)/PD-L1 (programmed death-ligand 1)-directed immunotherapy in non-small cell lung cancer (NSCLC) mostly focus on features of tumour cells. However, the tumour microenvironment and immune context are expected to play major roles in governing therapy response. Against th...
Understanding tumor resistance to T cell immunotherapies is critical to improve patient outcomes. Our study revealed a role for transcriptional suppression of the tumor-intrinsic HLA class I (HLA-I) antigen processing and presentation machinery (APM) in therapy resistance. Low HLA-I APM mRNA levels in melanoma metastases prior to immune checkpoint...
Accurate discovery of somatic variants is of central importance in cancer research. However, count statistics on discovered somatic insertions and deletions (indels) indicate that large amounts of discoveries are missed because of the quantification of uncertainties related to gap and alignment ambiguities, twilight zone indels, cancer heterogeneit...
Obtaining accurate mutational profiles from single cell DNA is essential for the analysis of genomic cell-to-cell heterogeneity at the finest level of resolution. However, sequencing libraries suitable for genotyping require whole genome amplification, which introduces allelic bias and copy errors. As a result, single cell DNA sequencing data viola...
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges...
Proteins in living cells rarely act alone, but instead perform their functions together with other proteins in so-called protein complexes. Being able to quantify the similarity between two protein complexes is essential for numerous applications, e.g. for database searches of complexes that are similar to a given input complex. While the similarit...
The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing...
As witnessed by various population-scale cancer genome sequencing projects, accurate discovery of somatic variants has become of central importance in modern cancer research. However, count statistics on somatic insertions and deletions (indels) discovered so far point out that large amounts of discoveries must have been missed. The reason is that...
The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing...
The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing...
The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology; analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing...
Motivation:
Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly is the reconstruction of strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains is an important step for various treatment-related reasons. Referenc...
Motivation:
Multiplexed error-robust fluorescence in-situ hybridization (MERFISH) is a recent technology to obtain spatially resolved gene or transcript expression profiles in single cells for hundreds to thousands of genes in parallel. So far, no statistical framework to analyze MERFISH data is available.
Results:
We present a Bayesian model fo...
Many areas of research suffer from poor reproducibility, particularly in computationally intensive domains where results rely on a series of complex methodological decisions that are not well captured by traditional publication approaches. Various guidelines have emerged for achieving reproducibility, but implementation of these practices remains d...
The notion of cancer as a complex evolutionary system has been validated by in-depth molecular analyses of tumor progression over the last years. While a complex interplay of cell-autonomous programs and cell-cell interactions determines proliferation and differentiation during normal development, intrinsic and acquired plasticity of cancer cells a...
Background:
RNA sequencing has become a ubiquitous technology used throughout life sciences as an effective method of measuring RNA abundance quantitatively in tissues and cells. The increase in use of RNA-seq technology has led to the continuous development of new tools for every step of analysis from alignment to downstream pathway analysis. How...
Protein interactions are fundamental building blocks of biochemical reaction systems underlying cellular functions. The complexity and functionality of these systems emerge not only from the protein interactions themselves but also...
Motivation
Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains. Viral quasispecies assembly refers to reconstructing the strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains, an important step for various treatment-related reasons. Referenc...
Being able to quantify the similarity between two protein complexes is essential for numerous applications. Prominent examples are database searches for known complexes with a given query complex, comparison of the output of different protein complex prediction algorithms, or summarizing and clustering protein complexes, e.g., for visualization. Wh...
Being able to quantify the similarity between two protein complexes is essential for numerous applications. Prominent examples are database searches for known complexes with a given query complex, comparison of the output of different protein complex prediction algorithms, or summarizing and clustering protein complexes, e.g., for visualization. Wh...
Cellular functions of biochemical reactions are enabled by protein interactions. In addition to the protein interactions themselves, dependencies between these interactions such as allosteric activation or mutual exclusion contribute to the complexity and functionality of these systems. We introduce a model of constrained protein interaction networ...
We present Bioconda (https://bioconda.github.io), a distribution of bioinformatics software for the lightweight, multi-platform and language-agnostic package manager Conda. Currently, Bioconda offers a collection of over 3000 software packages, which is continuously maintained, updated, and extended by a growing global community of more than 200 co...
Many areas of research suffer from poor reproducibility. This problem is particularly acute in computationally intensive domains where results rely on a series of complex methodological decisions that are not well captured by traditional publication approaches. Various guidelines have emerged for achieving reproducibility, but practical implementat...
Cancer is a genetic disorder in the first place. Therefore, next-generation sequencing (NGS) based discovery of somatically acquired genetic variants has gained widespread attention. Computational prediction of somatic variants, however, is affected by a variety of confounding factors. In addition to the uncertainties that one commonly encounters a...
Motivation:
Despite the growing popularity in using CRISPR/Cas9 technology for genome editing and gene knockout, its performance still relies on well designed single guide RNAs (sgRNA). In this study, we propose a web application for the Design and Optimization (CRISPR-DO) of guide sequences that target both coding and non-coding regions in spCas9...
(Cancer Cell 29, 574–586; April 11, 2016) In the originally published version of this article, coauthor Andrew M. Intlekofer was listed incorrectly as Andrew M. Intlekoffer and coauthor Nicole R. LeBoeuf was listed incorrectly as Nicole LaBoeuf. These errors have now been corrected here and in the article online. The authors apologize for the error...
Motivation: Third generation sequencing methods provide longer reads than second generation methods and have distinct error characteristics.
While there exist many read simulators for second generation data, there is a very limited choice for third generation data.
Results: We analyzed public data from Pacific Biosciences (PacBio) SMRT sequencing,...
More than 90% of drugs with preclinical activity fail in human trials, largely due to insufficient efficacy. We hypothesized that adequately powered trials of patient-derived xenografts (PDX) in mice could efficiently define therapeutic activity across heterogeneous tumors. To address this hypothesis, we established a large, publicly available repo...
High-throughput CRISPR screens have shown great promise in functional genomics. We present MAGeCK-VISPR, a comprehensive quality control (QC), analysis, and visualization workflow for CRISPR screens. MAGeCK-VISPR defines a set of QC measures to assess the quality of an experiment, and includes a maximum-likelihood algorithm to call essential genes...
To expedite the translation of biologic discoveries into novel therapeutics, there is a pressing need for panels of in vivo models that capture the molecular complexity of human disease. While traditional cell lines and genetically engineered mouse models are useful tools, they are insufficient to assess the broad diversity of human tumors within a...
We present Rust-Bio, the first general purpose bioinformatics library for the innovative Rust programming language. Rust-Bio leverages the unique combination of speed, memory safety and high-level syntax offered by
Rust to provide a fast and safe set of bioinformatics algorithms and data structures with a focus on sequence analysis.
Availability a...
Dysregulation of the cell cycle and cyclin-dependent kinases (cdks) is a hallmark of cancer cells. Intervention with cdk function is currently evaluated as a therapeutic option in many cancer types including neuroblastoma (NB), a common solid tumor of childhood. Re-analyses of mRNA profiling data from primary NB revealed that high level mRNA expres...
We present the q-group index, a novel data structure for read mapping tailored towards graphics processing units (GPUs) with a small memory footprint and efficient parallel algorithms for querying and building. On top of the q-group index we introduce PEANUT, a highly parallel GPU-based read mapper. PEANUT provides the possibility to output both th...
We present PEANUT (ParallEl AligNment UTility), a highly parallel GPU-based read mapper with several distinguishing features, including a novel q-gram index (called the q-group index) with small memory footprint built on-the-fly over the reads and the possibility to output both the best hits or all hits of a read. Designing the algorithm particular...
Hunger, malnutrition, food insecurity and famines are still persistent in many
regions globally. Especially on a local level new concepts of targeting food
insecurity need to be implemented. Prior to applying new practical concepts
the initial situation has to be understood accurately. Therefore, this
methodology paper aims at enhancing monitoring...
Neuroblastoma is the most common extracranial solid tumor of childhood, and accounts for ∼15% of all childhood cancer deaths. The histone demethylase, lysine-specific demethylase 1 (KDM1A, previously known as LSD1), is strongly expressed in neuroblastomas, and overexpression correlates with poor patient prognosis. Inducing differentiation in neurob...
In many cancer types, MYC proteins are known to be master regulators of the RNA-producing machinery. Neuroblastoma is a tumor of early childhood characterized by heterogeneous clinical courses. Amplification of the MYCN oncogene is a marker of poor patient outcome in this disease. Here, we investigated the MYCN-driven transcriptome of 20 primary ne...