Reinert Knut

Reinert Knut
Freie Universität Berlin | FUB · Institute of Computer Science

Prof. Dr.

About

225
Publications
72,511
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
38,037
Citations
Additional affiliations
April 1999 - November 2002
Celera Corporation
Celera Corporation
Position
  • Senior Researcher
Description
  • I worked on assembling the Human genome and later on algorithms for label free quantiation of Mass Spec data.
Education
May 1994 - April 1999
MPI for Computer Science
Field of study
  • Computer Science and Computational Biology
October 1989 - April 1994
Universität des Saarlandes
Field of study
  • Computer Science

Publications

Publications (225)
Article
Full-text available
We present Masai, a read mapper representing the state-of-the-art in terms of speed and accuracy. Our tool is an order of magnitude faster than RazerS 3 and mrFAST, 2–4 times faster and more accurate than Bowtie 2 and BWA. The novelties of our read mapper are filtration with approximate seeds and a method for multiple backtracking. Approximate seed...
Article
Full-text available
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 ne...
Article
Full-text available
Background The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson–Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and...
Article
Full-text available
Background We benchmarked sequencing technology and assembly strategies for short-read, long-read, and hybrid assemblers in respect to correctness, contiguity, and completeness of assemblies in genomes of Francisella tularensis. Benchmarking allowed in-depth analyses of genomic structures of the Francisella pathogenicity islands and insertion seque...
Article
Today’s scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating...
Article
Bisulfite sequencing data provide value beyond the straightforward methylation assessment by analyzing single-read patterns. Over the past years, various informative metrics have been established to explore this information. However, limited compatibility with alignment tools, reference genomes or the measurements they provide present a bottleneck...
Preprint
Full-text available
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created fro...
Article
Full-text available
We present Raptor, a system for approximately searching many queries like NGS reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative k-mers, an extension of the Interleaved Bloom Filters (IBF) as a set membership data structure, and probabilistic thresholding for minimiz...
Article
Full-text available
Engineered nanomaterials are potentially very useful for a variety of applications, but studies are needed to ascertain whether these materials pose a risk to human health. Here, we studied three benchmark nanomaterials (Ag nanoparticles, TiO2 nanoparticles, and multi-walled carbon nanotubes, MWCNTs) procured from the nanomaterial repository at the...
Article
Full-text available
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need...
Preprint
Full-text available
We present Raptor, a tool for approximately searching many queries in large collections of nucleotide sequences. In comparison with similar tools like Mantis and COBS, Raptor is 12-144 times faster and uses up to 30 times less memory. Raptor uses winnowing minimizers to define a set of representative k -mers, an extension of the Interleaved Bloom F...
Preprint
Full-text available
Background Clostridioides difficile infection (CDI) is an increasing zoonotic health threat and has also been documented as a cause of enteritis outbreaks in neonatal pigs. Furthermore, CDI in neonatal piglets cause changes in microbial gut colonization. We hypothesized that an imbalanced microbial colonization in piglets with CDI could be associat...
Article
Full-text available
Despite a well-documented effect of high dietary zinc oxide on the pig intestinal microbiota composition less is it yet known about changes in microbial functional properties or the effect of organic zinc sources. Forty weaning piglets in four groups were fed diets supplemented with 40 or 110 ppm zinc as zinc oxide, 110 ppm as Zn-Lysinate, or 2500...
Article
RNA-Sequencing is the current method of choice for studying bacterial transcriptomes. To date, many computational pipelines have been developed to predict differentially expressed genes from RNA-Seq data, but no gold-standard has been widely accepted. We present the Snakemake-based tool SCORE which uses a consensus approach founded on a selection o...
Article
Full-text available
Motivation: The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus...
Preprint
Full-text available
The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. Porting a work...
Article
Full-text available
Reconstructing haplotypes from sequencing data is one of the major challenges in genetics. Haplotypes play a crucial role in many analyses, including genome-wide association studies and population genetics. Haplotype reconstruction becomes more difficult for higher numbers of homologous chromosomes, as it is often the case for polyploid plants. Thi...
Preprint
Full-text available
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need...
Article
Full-text available
Zoonotic pathogens that can be transmitted via food to humans have a high potential for large-scale emergencies, comprising severe effects on public health, critical infrastructures, and the economy. In this context, the development of laboratory methods to rapidly detect zoonotic bacteria in the food supply chain, including high-resolution mass sp...
Preprint
Full-text available
Motivation DNA metabarcoding is a commonly applied technique used to infer the species composition of environmental samples. These samples can comprise hundreds of organisms that can be closely or very distantly related in the taxonomic tree of life. DNA metabarcoding combines polymerase chain reaction (PCR) and next-generation sequencing (NGS), wh...
Article
Full-text available
Motivation: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as...
Preprint
Full-text available
Background: Clostridium difficile infection (CDI) is an increasing zoonotic health threat and has also been documented as a cause of enteritis outbreaks in neonatal pigs. Furthermore, CDI in neonatal piglets cause changes in microbial gut colonization. We hypothesized that an imbalanced microbial colonization in piglets with CDI could be associated...
Presentation
Full-text available
The jigsaw puzzle of genomics: Evaluation of sequencing and assembly strategies for Francisella tularensis, a re-emerging zoonotic pathogen
Article
Accurate protein inference under the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient, but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for la...
Preprint
Full-text available
Motivation: Proteogenomics involves the supporting of gene models with experimental proteomics data. Mass spectrometry allows the measurement of peptides and the mass spectra can be assigned a peptide sequence using various algorithms. These tools were not designed for proteogenomics and peptide mapping to reference databases needs to be performed...
Preprint
Full-text available
Accurate protein inference under the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient, but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for la...
Article
Full-text available
Background: Natural variations in a genome can drastically alter the CRISPR-Cas9 off-target landscape by creating or removing sites. Despite the resulting potential side-effects from such unaccounted for sites, current off-target detection pipelines are not equipped to include variant information. To address this, we developed VARiant-aware detect...
Preprint
We present a fast and exact algorithm to compute the (k,e)-mappability. Its inverse, the (k,e)-frequency counts the number of occurrences of each k-mer with up to e errors in a sequence. The algorithm we present is a magnitude faster than the algorithm in the widely used GEM suite while not relying on heuristics, and can even compute the mappabilit...
Article
Motivation: Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and...
Article
Full-text available
Motivation: Mapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. >10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instanc...
Preprint
Full-text available
Motivation The exponential growth of assembled genome sequences greatly benefits metagenomics studies. However, currently available methods struggle to manage the increasing amount of sequences and their frequent updates. Indexing the current RefSeq can take days and hundreds of GB of memory on large servers. Few methods address these issues thus f...
Preprint
Full-text available
Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search scheme...
Preprint
Full-text available
Motivation Mapping-based approaches have become limited in their application to very large sets of references since computing an FM-index for very large databases (e.g. > 10 GB) has become a bottleneck. This affects many analyses that need such index as an essential step for approximate matching of the NGS reads to reference databases. For instance...
Article
Full-text available
Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics. Bidirectional indices have opened new possibilities for solving the problem as they allow the search to be started from anywhere within the pattern and extended in both directions. In particular, use of search schemes (partitioning...
Article
Spontaneous Clostridium difficile (CD) outbreaks occur in neonatal piglets but the predisposing factors are largely not known. To study the conditions for CD colonisation and infection development, neonatal piglets (n=48) were moved into isolators, fed bovine milk-based formula and infected with CD 078. Analyses included: clinical scoring; faecal C...
Article
Background: The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome Venter et al. (2001) would not have been possible without advanced assembly algorithms and the development of practical BWT based read mappers have been instrumental for NGS analysis. However, ow...
Article
Full-text available
Motivation: High throughput sequencing machines can process many samples in a single run. For Illumina systems, sequencing reads are barcoded with an additional DNA tag that is contained in the respective sequencing adapters. The recognition of barcode and adapter sequences is hence commonly needed for the analysis of next generation sequencing da...
Article
Background: In recent years, several mass spectrometry-based omics technologies emerged to investigate qualitative and quantitative changes within thousands of biologically active components such as proteins, lipids and metabolites. The research enabled through these methods potentially contributes to the diagnosis and pathophysiology of human dis...
Conference Paper
The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If \(\sigma \) is the size of the al...
Article
Full-text available
Identification and quantification of microorganisms is a significant step in studying the alpha and beta diversities within and between microbial communities respectively. Both identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than when using 16S-rDNA sequences. H...
Data
Supplementary data containing, details of datasets used for this study, accuracy comparison of different methods per datasets, runtime and memory consumption of each method for individual datasets and statistical details (STDDEV, MEAN, Variance, Q1, Q2(median), Q3) of the differences b/n real and predicted abundances.
Data
Precision—call Curves: SLIMM vs Existing Methods of 8 different datasets True Positive Rate (TPR)/recall drawn against precision. SLIMM received the highest performance for all of the datasets by detecting most of the microorganisms in each sample while staying precise.
Data
Violin Plots of the difference between real and predicted abundances: SLIMM vs. Existing Methods The violin plots show how well the different tools predicted the abundances compared to the actual abundances across eight different datasets. From the plots, we can clearly see that SLIMM has the lowest divergence from the actual abundance for most of...
Data
Scatter plots showing predicted vs real abundances by different SLIMM variants Abundances of 8 different samples predicted by different flavors of SLIMM compared to the true abundance used for simulation.
Data
Precision—Recall Curves of different SLIMM variants for 8 different datasets True Positive Rate (TPR)/recall drawn against precision. These plots show the accuracy performance of different SLIMM variants, i.e., SLIMM, SLIMM-DG (with digital normalization), SLIMM-NF (without filtration step based on coverage landscape), SLIMM-NF-DG (without filtrati...
Data
Violin Plots of the difference between real and predicted abundances: different SLIMM variants The violin plots show how well the different variants of SLIMM predicted the abundances compared to the actual abundances across eight different datasets.
Data
Scatter plots showing predicted vs real abundances Abundances of 8 different samples predicted by different tools compared to the true abundance used for simulation. SLIMM predicted the abundances more accurately than the other tools.
Article
The combination of high-performance liquid chromatography and electrospray ionization ion mobility spectrometry facilitates the two-dimensional separation of complex mixtures in the retention and drift time plane. The ion mobility spectrometer presented here was optimized for flow rates customarily used in High-performance liquid chromatography bet...
Article
Full-text available
Background The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). Methods W...
Article
We sometimes see manufactured bakery products on the market which are labelled as being gluten free. Why is the content of such gluten proteins of importance for the fabrication of bakery industry and for the products? The gluten proteins represent up to 80 % of wheat proteins, and they are conventionally subdivided into gliadins and glutenins. Gli...
Article
Full-text available
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines wi...
Article
Full-text available
High-resolution mass spectrometry (MS) has become an important tool in the life sciences, contributing to the diagnosis and understanding of human diseases, elucidating biomolecular structural information and characterizing cellular signaling networks. However, the rapid growth in the volume and complexity of MS data makes transparent, accurate and...
Preprint
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens , the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines w...
Preprint
Full-text available
Identification and quantification of microorganisms is an important step in studying the alpha and beta diversities within and between microbial communities respectively. Both, identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than using 16S-rRNA sequences. Howeve...
Preprint
Full-text available
Identification and quantification of microorganisms is an important step in studying the alpha and beta diversities within and between microbial communities respectively. Both, identification and quantification of a given microbial community can be carried out using whole genome shotgun sequences with less bias than using 16S-rRNA sequences. Howeve...
Article
Full-text available
We introduce a new method for conducting an exact search in a uni- and bidirectional FM index in $\mathcal{O}(1)$ time per step while using $\mathcal{O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)$ bits of space. This is done by replacing the wavelet tree by a new data structure, the \emph{Enhanced Prefixsum Rank dictionary} (EPR-dic...
Article
Full-text available
Background: Reproducibility is one of the tenets of the scientific method. Scientific experiments often comprise complex data flows, selection of adequate parameters, and analysis and visualization of intermediate and end results. Breaking down the complexity of such experiments into the joint collaboration of small, repeatable, well defined tasks...
Article
Full-text available
We present CIDANE, a novel framework for genome-based transcript reconstruction and quantification from RNA-seq reads. CIDANE assembles transcripts efficiently with significantly higher sensitivity and precision than existing tools. Its algorithmic core not only reconstructs transcripts ab initio, but also allows the use of the growing annotation o...