David Koslicki

David Koslicki
Pennsylvania State University | Penn State · Department of Computer Science and Engineering

PhD

About

86
Publications
15,145
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,150
Citations
Additional affiliations
August 2013 - present
Oregon State University
Position
  • Professor (Assistant)
January 2012 - July 2012
Drexel University
Position
  • PostDoc Position
August 2012 - August 2013
The Ohio State University
Position
  • PostDoc Position

Publications

Publications (86)
Article
Motivation Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing data...
Preprint
Full-text available
Distance-guided tree construction with unknown tree topology and branch lengths has been a long studied problem. In contrast, distance-guided branch lengths assignment with fixed tree topology has not yet been systematically investigated, despite having significant applications. In this paper, we provide a formal mathematical formulation of this pr...
Preprint
Full-text available
The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k-mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in...
Preprint
Full-text available
Motivation: The sheer volume and variety of genomic content within microbial communities makes metagenomics a field rich in biomedical knowledge. To traverse these complex communities and their vast unknowns, metagenomic studies often depend on distinct reference databases, such as the Genome Taxonomy Database (GTDB), the Kyoto Encyclopedia of Gene...
Article
Full-text available
Motivation In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. Existing tools generally return point estimates, with no associated confi...
Conference Paper
Idiopathic multicentric Castleman disease (iMCD) is a rare, immunological illness with an unknown etiology. It is characterized by multiple enlarged lymph nodes with unique histopathological features and systemic inflammation resulting from cytokine release syndrome, which can rapidly develop into multiple organ failure and death. Inhibition of the...
Preprint
Full-text available
Sketching methods provide scalable solutions for analyzing rapidly growing genomic data. A recent innovation in sketching methods, syncmers, has proven effective and has been employed for read alignment. Syncmers share fundamental features with the FracMinHash technique, a recent modification of the popular MinHash algorithm for set similarity esti...
Preprint
Full-text available
Motivation Functional profiling of metagenomic samples is essential to decipher the functional capabilities of these microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growin...
Article
Full-text available
Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium...
Article
Full-text available
Motivation: Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely used metric for measuring the variability between metagenomic samples. We pro...
Article
Full-text available
Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHa...
Preprint
Full-text available
In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. While tools exist to answer this question, all existing approaches to date return po...
Article
Full-text available
Motivation: With the rapidly growing volume of knowledge and data in biomedical databases, improved methods for knowledge-graph-based computational reasoning are needed in order to answer translational questions. Previous efforts to solve such challenging computational reasoning problems have contributed tools and approaches, but progress has been...
Preprint
Full-text available
Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely-used metric for measuring the variability between metagenomic samples. We propose that the...
Article
Full-text available
Background: Metagenomic taxonomic profiling aims to predict the identity and relative abundance of taxa in a given whole-genome sequencing metagenomic sample. A recent surge in computational methods that aim to accurately estimate taxonomic profiles, called taxonomic profilers, has motivated community-driven efforts to create standardized benchmar...
Article
Full-text available
Background: Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discover...
Preprint
Full-text available
Background Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery...
Article
Full-text available
Background Biomedical translational science is increasingly using computational reasoning on repositories of structured knowledge (such as UMLS, SemMedDB, ChEMBL, Reactome, DrugBank, and SMPDB in order to facilitate discovery of new therapeutic targets and modalities. The NCATS Biomedical Data Translator project is working to federate autonomous re...
Preprint
Full-text available
Motivation: With the rapidly growing volume of knowledge and data in biomedical databases, improved methods for knowledge-graph-based computational reasoning are needed in order to answer translational questions. Previous efforts to solve such challenging computational reasoning problems have contributed tools and approaches, but progress has been...
Article
Full-text available
Motivation: K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, w...
Article
Full-text available
Motivation: Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequ...
Preprint
Full-text available
Metagenomic taxonomic profiling aims to predict the identity and relative abundance of taxa in a given whole genome sequencing metagenomic sample. A recent surge in computational methods that aim to accurately estimate taxonomic profiles, called taxonomic profilers, have motivated community driven efforts to create standardized benchmarking dataset...
Article
Full-text available
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 ne...
Article
Full-text available
While the use of networks to understand how complex systems respond to perturbations is pervasive across scientific disciplines, the uncertainty associated with estimates of pairwise interaction strengths (edge weights) remains rarely considered. Mischaracterizations of interaction strength can lead to qualitatively incorrect predictions regarding...
Article
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption...
Preprint
Full-text available
Motivation Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this paper, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequence...
Preprint
Full-text available
The identification of reference genomes and taxonomic labels from metagenome data underlies many microbiome studies. Here we describe two algorithms for compositional analysis of metagenome sequencing data. We first investigate the FracMinHash sketching technique, a derivative of modulo hash that supports Jaccard containment estimation between sets...
Preprint
Full-text available
Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced...
Preprint
Full-text available
K-mer based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where data sets are large. Here, we introduce...
Preprint
Full-text available
Background Biomedical translational science is increasingly leveraging computational reasoning on large repositories of structured knowledge (such as the Unified Medical Language System (UMLS), the Semantic Medline Database (SemMedDB), ChEMBL, DrugBank, and the Small Molecule Pathway Database (SMPDB)) and data in order to facilitate discovery of ne...
Article
Full-text available
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today’s diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 1...
Preprint
Full-text available
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created fro...
Article
Full-text available
‘Knowledge graphs’ (KGs) have become a common approach for representing biomedical knowledge. In a KG, multiple biomedical datasets can be linked together as a graph representation, with nodes representing entities such as ‘chemical substance’ or ‘genes’ and edges representing predicates such as ‘causes’ or ‘treats’. Reasoning and inference algorit...
Article
Computational methods are key in microbiome research, and obtaining a quantitative and unbiased performance estimate is important for method developers and applied researchers. For meaningful comparisons between methods, to identify best practices and common use cases, and to reduce overhead in benchmarking, it is necessary to have standardized dat...
Preprint
Full-text available
K-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g. a genome or a read) undergoes a simple mutation process whereby each nucleotide is mutated independently with some probability r, under the assumption that t...
Article
Full-text available
Abstract Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehens...
Article
Full-text available
Metagenomic profiling, predicting the presence and relative abundances of microbes in a sample, is a critical first step in microbiome analysis. Alignment-based approaches are often considered accurate yet computationally infeasible. Here, we present a novel method, Metalign, that performs efficient and accurate alignment-based metagenomic profilin...
Preprint
Full-text available
Computational methods are key in microbiome research, and obtaining a quantitative and unbiased performance estimate is important for method developers and applied researchers. For meaningful comparisons between methods, to identify best practices, common use cases, and to reduce overhead in benchmarking, it is necessary to have standardized data s...
Article
Full-text available
An amendment to this paper has been published and can be accessed via the original article.
Preprint
Full-text available
Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identify...
Preprint
Full-text available
When analyzing communities of microorganisms from their sequenced DNA, an important task is taxonomic profiling: enumerating the presence and relative abundance of all organisms, or merely of all taxa, contained in the sample. This task can be tackled via compressive-sensing-based approaches, which favor communities featuring the fewest organisms a...
Preprint
Full-text available
Whole-genome shotgun sequencing enables the analysis of microbial communities in unprecedented detail, with major implications in medicine and ecology. Predicting the presence and relative abundances of microbes in a sample, known as "metagenomic profiling", is a critical first step in microbiome analysis. Existing profiling methods have been shown...
Preprint
Full-text available
During the past decade, rapid advancements in sequencing technologies have enabled the study of human-associated microbiomes at an unprecedented scale. Metagenomics, specifically, is emerging as a clinical tool to identify the agents of infection, track the spread of diseases, and surveil potential pathogens. Yet, despite advances in high-throughpu...
Preprint
Full-text available
A bstract Computational drug repurposing, also called drug repositioning, is a low cost, promising tool for finding new uses for existing drugs. With the continued growth of repositories of biomedical data and knowledge, increasingly varied kinds of information are available to train machine learning approaches to drug repurposing. However, existin...
Conference Paper
Full-text available
The past decade has witnessed a revolution in microbiology through the advent microbiomics: investigating entire microbial ecosystems in and around us as opposed to studying single microbes in isolation. Recent technological advances in high throughput sequencing technology has enabled these studies, significantly improving our capacity to explore...
Article
MinHash is a probabilistic method for estimating the similarity of two sets in terms of their Jaccard index, defined as the ratio of the size of their intersection to their union. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very di...
Article
Full-text available
Background: High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Ad...
Article
Full-text available
The explosive growth in taxonomic metagenome profiling methods over the past years has created a need for systematic comparisons using relevant performance criteria. The Open-community Profiling Assessment tooL (OPAL) implements commonly used performance metrics, including those of the first challenge of the initiative for the Critical Assessment o...
Preprint
Full-text available
Reference genomes are essential for metagenomics studies, which require comparing short metagenomic reads with available reference genomes to identify organisms within a sample. We analyzed the current state of fungal reference databases to assess their usability as reference databases for metagenomic studies. The overlap of genera and species in t...
Article
Full-text available
Both the weighted and unweighted Unifrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the Unifrac metric is equivalent to the so-called earth mover's distance (also known as the Kantorovich-Rubinstei...
Preprint
Full-text available
Taxonomic metagenome profilers predict the presence and relative abundance of microorganisms from shotgun sequence samples of DNA isolated directly from a microbial community. Over the past years, there has been an explosive growth of software and algorithms for this task, resulting in a need for more systematic comparisons of these methods based o...
Article
Full-text available
The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis, and bipolar disorder. By using high-quality u...
Preprint
Full-text available
The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis and bipolar disorder. By using high quality un...
Article
Full-text available
We consider the goal of predicting how complex networks respond to chronic (press) perturbations when characterizations of their network topology and interaction strengths are associated with uncertainty. Our primary result is the derivation of exact formulas for the expected number and probability of qualitatively incorrect predictions about a sys...
Preprint
Full-text available
Background High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Addi...
Article
Full-text available
Motivation: Genomic networks represent a complex map of molecular interactions which are descriptive of the biological processes occurring in living cells. Identifying the small over-represented circuitry patterns in these networks helps generate hypotheses about the functional basis of such complex processes. Network motif discovery is a systemat...
Article
Full-text available
Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic dat...
Preprint
Full-text available
Min hash is a probabilistic method for estimating the similarity of two sets in terms of their Jaccard index, defined as the ration of the size of their intersection to their union. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very...
Preprint
Full-text available
Genomic networks represent a complex map of molecular interactions which are descriptive of the biological processes occurring in living cells. Identifying the small over-represented circuitry patterns in these networks helps generate hypotheses about the functional basis of such complex processes. Network motif discovery is a systematic way of ach...
Preprint
Full-text available
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI)...
Presentation
Full-text available
IndeCut is the very first practical method for determining the degree of uniformity/independence of network motif discovery algorithm sampling.
Preprint
Full-text available
Both the weighted and unweighted Unifrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the Unifrac metric is equivalent to the so-called earth mover’s distance (also known as the Kantorovich-Rubinstei...
Preprint
Full-text available
We consider the goal of predicting how complex networks respond to chronic (press) perturbations when characterizations of their network topology and interaction strengths are associated with uncertainty. Our primary result is the derivation of exact formulas for the expected number and probability of qualitatively incorrect predictions about a sys...