
Reinert KnutFreie Universität Berlin | FUB · Institute of Computer Science
Reinert Knut
Prof. Dr.
About
225
Publications
72,511
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
38,037
Citations
Introduction
Additional affiliations
April 1999 - November 2002

Celera Corporation
Position
- Senior Researcher
Description
- I worked on assembling the Human genome and later on algorithms for label free quantiation of Mass Spec data.
Education
May 1994 - April 1999
MPI for Computer Science
Field of study
- Computer Science and Computational Biology
October 1989 - April 1994
Publications
Publications (225)
We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of
magnitude faster than RazerS 3 and mrFAST, 2–4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our
read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seed...
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 ne...
Background
The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and...
Background
We benchmarked sequencing technology and assembly strategies for short-read, long-read, and hybrid assemblers in respect to correctness, contiguity, and completeness of assemblies in genomes of Francisella tularensis. Benchmarking allowed in-depth analyses of genomic structures of the Francisella pathogenicity islands and insertion seque...
Today’s scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating...
Bisulfite sequencing data provide value beyond the straightforward methylation assessment by analyzing single-read patterns. Over the past years, various informative metrics have been established to explore this information. However, limited compatibility with alignment tools, reference genomes or the measurements they provide present a bottleneck...
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created fro...
We present Raptor, a system for approximately searching many queries like NGS reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the Interleaved Bloom Filters (IBF) as a set membership data structure, and probabilistic thresholding for minimiz...
Engineered nanomaterials are potentially very useful for a variety of applications, but studies are needed to ascertain whether these materials pose a risk to human health. Here, we studied three benchmark nanomaterials (Ag nanoparticles, TiO2 nanoparticles, and multi-walled carbon nanotubes, MWCNTs) procured from the nanomaterial repository at the...
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need...
We present Raptor, a tool for approximately searching many queries in large collections of nucleotide sequences. In comparison with similar tools like Mantis and COBS, Raptor is 12-144 times faster and uses up to 30 times less memory. Raptor uses winnowing minimizers to define a set of representative k -mers, an extension of the Interleaved Bloom F...
Background Clostridioides difficile infection (CDI) is an increasing zoonotic health threat and has also been documented as a cause of enteritis outbreaks in neonatal pigs. Furthermore, CDI in neonatal piglets cause changes in microbial gut colonization. We hypothesized that an imbalanced microbial colonization in piglets with CDI could be associat...
Despite a well-documented effect of high dietary zinc oxide on the pig intestinal microbiota composition less is it yet known about changes in microbial functional properties or the effect of organic zinc sources. Forty weaning piglets in four groups were fed diets supplemented with 40 or 110 ppm zinc as zinc oxide, 110 ppm as Zn-Lysinate, or 2500...
RNA-Sequencing is the current method of choice for studying bacterial transcriptomes. To date, many computational pipelines have been developed to predict differentially expressed genes from RNA-Seq data, but no gold-standard has been widely accepted. We present the Snakemake-based tool SCORE which uses a consensus approach founded on a selection o...
Motivation:
The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus...
The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a work...
Reconstructing haplotypes from sequencing data is one of the major challenges in genetics. Haplotypes play a crucial role in many analyses, including genome-wide association studies and population genetics. Haplotype reconstruction becomes more difficult for higher numbers of homologous chromosomes, as it is often the case for polyploid plants. Thi...
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need...
Zoonotic pathogens that can be transmitted via food to humans have a high potential for large-scale emergencies, comprising severe effects on public health, critical infrastructures, and the economy. In this context, the development of laboratory methods to rapidly detect zoonotic bacteria in the food supply chain, including high-resolution mass sp...
Motivation
DNA metabarcoding is a commonly applied technique used to infer the species composition of environmental samples. These samples can comprise hundreds of organisms that can be closely or very distantly related in the taxonomic tree of life. DNA metabarcoding combines polymerase chain reaction (PCR) and next-generation sequencing (NGS), wh...
Motivation:
Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as...
Background: Clostridium difficile infection (CDI) is an increasing zoonotic health threat and has also been documented as a cause of enteritis outbreaks in neonatal pigs. Furthermore, CDI in neonatal piglets cause changes in microbial gut colonization. We hypothesized that an imbalanced microbial colonization in piglets with CDI could be associated...
The jigsaw puzzle of genomics: Evaluation of sequencing and assembly strategies for Francisella tularensis, a re-emerging zoonotic pathogen
Accurate protein inference under the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient, but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for la...
Motivation: Proteogenomics involves the supporting of gene models with experimental proteomics data. Mass spectrometry allows the measurement of peptides and the mass spectra can be assigned a peptide sequence using various algorithms. These tools were not designed for proteogenomics and peptide mapping to reference databases needs to be performed...
Accurate protein inference under the presence of shared peptides is still one of the key problems in bottom-up proteomics.
Most protein inference tools employing simple heuristic inference strategies are efficient, but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for la...
Background:
Natural variations in a genome can drastically alter the CRISPR-Cas9 off-target landscape by creating or removing sites. Despite the resulting potential side-effects from such unaccounted for sites, current off-target detection pipelines are not equipped to include variant information. To address this, we developed VARiant-aware detect...
We present a fast and exact algorithm to compute the (k,e)-mappability. Its inverse, the (k,e)-frequency counts the number of occurrences of each k-mer with up to e errors in a sequence. The algorithm we present is a magnitude faster than the algorithm in the widely used GEM suite while not relying on heuristics, and can even compute the mappabilit...
Motivation:
Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and...
Motivation:
Mapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. >10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instanc...
Motivation
The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus f...
Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search scheme...
Motivation
Mapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. > 10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance...
Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics. Bidirectional indices have opened new possibilities for solving the problem as they allow the search to be started from anywhere within the pattern and extended in both directions. In particular, use of search schemes (partitioning...
Spontaneous Clostridium difficile (CD) outbreaks occur in neonatal piglets but the predisposing factors are largely not known. To study the conditions for CD colonisation and infection development, neonatal piglets (n=48) were moved into isolators, fed bovine milk-based formula and infected with CD 078. Analyses included: clinical scoring; faecal C...
Background:
The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome Venter et al. (2001) would not have been possible without advanced assembly algorithms and the development of practical BWT based read mappers have been instrumental for NGS analysis. However, ow...
Motivation:
High throughput sequencing machines can process many samples in a single run. For Illumina systems, sequencing reads are barcoded with an additional DNA tag that is contained in the respective sequencing adapters. The recognition of barcode and adapter sequences is hence commonly needed for the analysis of next generation sequencing da...
Background:
In recent years, several mass spectrometry-based omics technologies emerged to investigate qualitative and quantitative changes within thousands of biologically active components such as proteins, lipids and metabolites. The research enabled through these methods potentially contributes to the diagnosis and pathophysiology of human dis...
The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If \(\sigma \) is the size of the al...
Identification and quantification of microorganisms is a significant step in studying the alpha and beta diversities within and between microbial communities respectively. Both identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than when using 16S-rDNA sequences. H...
Supplementary data containing, details of datasets used for this study, accuracy comparison of different methods per datasets, runtime and memory consumption of each method for individual datasets and statistical details (STDDEV, MEAN, Variance, Q1, Q2(median), Q3) of the differences b/n real and predicted abundances.
Precision—call Curves: SLIMM vs Existing Methods of 8 different datasets
True Positive Rate (TPR)/recall drawn against precision. SLIMM received the highest performance for all of the datasets by detecting most of the microorganisms in each sample while staying precise.
Violin Plots of the difference between real and predicted abundances: SLIMM vs. Existing Methods
The violin plots show how well the different tools predicted the abundances compared to the actual abundances across eight different datasets. From the plots, we can clearly see that SLIMM has the lowest divergence from the actual abundance for most of...
Scatter plots showing predicted vs real abundances by different SLIMM variants
Abundances of 8 different samples predicted by different flavors of SLIMM compared to the true abundance used for simulation.
Precision—Recall Curves of different SLIMM variants for 8 different datasets
True Positive Rate (TPR)/recall drawn against precision. These plots show the accuracy performance of different SLIMM variants, i.e., SLIMM, SLIMM-DG (with digital normalization), SLIMM-NF (without filtration step based on coverage landscape), SLIMM-NF-DG (without filtrati...
Violin Plots of the difference between real and predicted abundances: different SLIMM variants
The violin plots show how well the different variants of SLIMM predicted the abundances compared to the actual abundances across eight different datasets.
Scatter plots showing predicted vs real abundances
Abundances of 8 different samples predicted by different tools compared to the true abundance used for simulation. SLIMM predicted the abundances more accurately than the other tools.
The combination of high-performance liquid chromatography and electrospray ionization ion mobility spectrometry facilitates the two-dimensional separation of complex mixtures in the retention and drift time plane. The ion mobility spectrometer presented here was optimized for flow rates customarily used in High-performance liquid chromatography bet...
Background
The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). Methods
W...
We sometimes see manufactured bakery products on the market which are labelled as being gluten free. Why is the content of such gluten proteins of importance for the fabrication of bakery industry and for the products? The gluten proteins represent up to 80 % of wheat proteins, and they are conventionally subdivided into gliadins and glutenins. Gli...
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge
of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will
approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines wi...
High-resolution mass spectrometry (MS) has become an important tool in the life sciences, contributing to the diagnosis and understanding of human diseases, elucidating biomolecular structural information and characterizing cellular signaling networks. However, the rapid growth in the volume and complexity of MS data makes transparent, accurate and...
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens , the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines w...
Identification and quantification of microorganisms is an important step in studying the alpha and beta diversities within and between microbial communities respectively. Both, identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than using 16S-rRNA sequences. Howeve...
Identification and quantification of microorganisms is an important step in studying the alpha and beta diversities within and between microbial communities respectively. Both, identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than using 16S-rRNA sequences. Howeve...
We introduce a new method for conducting an exact search in a uni- and bidirectional FM index in $\mathcal{O}(1)$ time per step while using $\mathcal{O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)$ bits of space. This is done by replacing the wavelet tree by a new data structure, the \emph{Enhanced Prefixsum Rank dictionary} (EPR-dic...
Background:
Reproducibility is one of the tenets of the scientific method. Scientific experiments often comprise complex data flows, selection of adequate parameters, and analysis and visualization of intermediate and end results. Breaking down the complexity of such experiments into the joint collaboration of small, repeatable, well defined tasks...
We present CIDANE, a novel framework for genome-based transcript reconstruction and quantification from RNA-seq reads. CIDANE assembles transcripts efficiently with significantly higher sensitivity and precision than existing tools. Its algorithmic core not only reconstructs transcripts ab initio, but also allows the use of the growing annotation o...