About
86
Publications
15,145
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,150
Citations
Introduction
Additional affiliations
August 2013 - present
January 2012 - July 2012
August 2012 - August 2013
Publications
Publications (86)
Motivation
Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing data...
Distance-guided tree construction with unknown tree topology and branch lengths has been a long studied problem. In contrast, distance-guided branch lengths assignment with fixed tree topology has not yet been systematically investigated, despite having significant applications. In this paper, we provide a formal mathematical formulation of this pr...
The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k-mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in...
Motivation: The sheer volume and variety of genomic content within microbial communities makes metagenomics a field rich in biomedical knowledge. To traverse these complex communities and their vast unknowns, metagenomic studies often depend on distinct reference databases, such as the Genome Taxonomy Database (GTDB), the Kyoto Encyclopedia of Gene...
Motivation
In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. Existing tools generally return point estimates, with no associated confi...
Idiopathic multicentric Castleman disease (iMCD) is a rare, immunological illness with an unknown etiology. It is characterized by multiple enlarged lymph nodes with unique histopathological features and systemic inflammation resulting from cytokine release syndrome, which can rapidly develop into multiple organ failure and death. Inhibition of the...
Sketching methods provide scalable solutions for analyzing rapidly growing genomic data. A recent innovation in sketching methods, syncmers, has proven effective and has been employed for read alignment. Syncmers share fundamental features with the FracMinHash technique, a recent modification of the popular MinHash algorithm for set similarity esti...
Motivation
Functional profiling of metagenomic samples is essential to decipher the functional capabilities of these microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growin...
Knowledge graphs have become a common approach for knowledge representation. Yet, the application of graph methodology is elusive due to the sheer number and complexity of knowledge sources. In addition, semantic incompatibilities hinder efforts to harmonize and integrate across these diverse sources. As part of The Biomedical Translator Consortium...
Motivation:
Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely used metric for measuring the variability between metagenomic samples. We pro...
Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique to estimate set similarity that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHa...
In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome.
While tools exist to answer this question, all existing approaches to date return po...
Motivation:
With the rapidly growing volume of knowledge and data in biomedical databases, improved methods for knowledge-graph-based computational reasoning are needed in order to answer translational questions. Previous efforts to solve such challenging computational reasoning problems have contributed tools and approaches, but progress has been...
Metagenomic samples have high spatiotemporal variability. Hence, it is useful to summarize and characterize the microbial makeup of a given environment in a way that is biologically reasonable and interpretable. The UniFrac metric has been a robust and widely-used metric for measuring the variability between metagenomic samples. We propose that the...
Background:
Metagenomic taxonomic profiling aims to predict the identity and relative abundance of taxa in a given whole-genome sequencing metagenomic sample. A recent surge in computational methods that aim to accurately estimate taxonomic profiles, called taxonomic profilers, has motivated community-driven efforts to create standardized benchmar...
Background:
Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discover...
Background
Computational drug repurposing is a cost- and time-efficient approach that aims to identify new therapeutic targets or diseases (indications) of existing drugs/compounds. It is especially critical for emerging and/or orphan diseases due to its cheaper investment and shorter research cycle compared with traditional wet-lab drug discovery...
Background
Biomedical translational science is increasingly using computational reasoning on repositories of structured knowledge (such as UMLS, SemMedDB, ChEMBL, Reactome, DrugBank, and SMPDB in order to facilitate discovery of new therapeutic targets and modalities. The NCATS Biomedical Data Translator project is working to federate autonomous re...
Motivation: With the rapidly growing volume of knowledge and data in biomedical databases, improved methods for knowledge-graph-based computational reasoning are needed in order to answer translational questions. Previous efforts to solve such challenging computational reasoning problems have contributed tools and approaches, but progress has been...
Motivation:
K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, w...
Motivation:
Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequ...
Metagenomic taxonomic profiling aims to predict the identity and relative abundance of taxa in a given whole genome sequencing metagenomic sample. A recent surge in computational methods that aim to accurately estimate taxonomic profiles, called taxonomic profilers, have motivated community driven efforts to create standardized benchmarking dataset...
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 ne...
While the use of networks to understand how complex systems respond to perturbations is pervasive across scientific disciplines, the uncertainty associated with estimates of pairwise interaction strengths (edge weights) remains rarely considered. Mischaracterizations of interaction strength can lead to qualitatively incorrect predictions regarding...
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption...
Motivation
Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this paper, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequence...
The identification of reference genomes and taxonomic labels from metagenome data underlies many microbiome studies. Here we describe two algorithms for compositional analysis of metagenome sequencing data. We first investigate the FracMinHash sketching technique, a derivative of modulo hash that supports Jaccard containment estimation between sets...
Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced...
K-mer based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where data sets are large. Here, we introduce...
Background
Biomedical translational science is increasingly leveraging computational reasoning on large repositories of structured knowledge (such as the Unified Medical Language System (UMLS), the Semantic Medline Database (SemMedDB), ChEMBL, DrugBank, and the Small Molecule Pathway Database (SMPDB)) and data in order to facilitate discovery of ne...
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today’s diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 1...
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created fro...
‘Knowledge graphs’ (KGs) have become a common approach for representing biomedical knowledge. In a KG, multiple biomedical datasets can be linked together as a graph representation, with nodes representing entities such as ‘chemical substance’ or ‘genes’ and edges representing predicates such as ‘causes’ or ‘treats’. Reasoning and inference algorit...
Computational methods are key in microbiome research, and obtaining a quantitative and unbiased performance estimate is important for method developers and applied researchers. For meaningful comparisons between methods, to identify best practices and common use cases, and to reduce overhead in benchmarking, it is necessary to have standardized dat...
K-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g. a genome or a read) undergoes a simple mutation process whereby each nucleotide is mutated independently with some probability r, under the assumption that t...
Abstract Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehens...
Metagenomic profiling, predicting the presence and relative abundances of microbes in a sample, is a critical first step in microbiome analysis. Alignment-based approaches are often considered accurate yet computationally infeasible. Here, we present a novel method, Metalign, that performs efficient and accurate alignment-based metagenomic profilin...
Computational methods are key in microbiome research, and obtaining a quantitative and unbiased performance estimate is important for method developers and applied researchers. For meaningful comparisons between methods, to identify best practices, common use cases, and to reduce overhead in benchmarking, it is necessary to have standardized data s...
An amendment to this paper has been published and can be accessed via the original article.
Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identify...
When analyzing communities of microorganisms from their sequenced DNA, an important task is taxonomic profiling: enumerating the presence and relative abundance of all organisms, or merely of all taxa, contained in the sample. This task can be tackled via compressive-sensing-based approaches, which favor communities featuring the fewest organisms a...
Whole-genome shotgun sequencing enables the analysis of microbial communities in unprecedented detail, with major implications in medicine and ecology. Predicting the presence and relative abundances of microbes in a sample, known as "metagenomic profiling", is a critical first step in microbiome analysis. Existing profiling methods have been shown...
During the past decade, rapid advancements in sequencing technologies have enabled the study of human-associated microbiomes at an unprecedented scale. Metagenomics, specifically, is emerging as a clinical tool to identify the agents of infection, track the spread of diseases, and surveil potential pathogens. Yet, despite advances in high-throughpu...
A bstract
Computational drug repurposing, also called drug repositioning, is a low cost, promising tool for finding new uses for existing drugs. With the continued growth of repositories of biomedical data and knowledge, increasingly varied kinds of information are available to train machine learning approaches to drug repurposing. However, existin...
The past decade has witnessed a revolution in microbiology through the advent microbiomics: investigating entire microbial ecosystems in and around us as opposed to studying single microbes in isolation. Recent technological advances in high throughput sequencing technology has enabled these studies, significantly improving our capacity to explore...
MinHash is a probabilistic method for estimating the similarity of two sets in terms of their Jaccard index, defined as the ratio of the size of their intersection to their union. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very di...
Background:
High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Ad...
The explosive growth in taxonomic metagenome profiling methods over the past years has created a need for systematic comparisons using relevant performance criteria. The Open-community Profiling Assessment tooL (OPAL) implements commonly used performance metrics, including those of the first challenge of the initiative for the Critical Assessment o...
Reference genomes are essential for metagenomics studies, which require comparing short metagenomic reads with available reference genomes to identify organisms within a sample. We analyzed the current state of fungal reference databases to assess their usability as reference databases for metagenomic studies. The overlap of genera and species in t...
Both the weighted and unweighted Unifrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the Unifrac metric is equivalent to the so-called earth mover's distance (also known as the Kantorovich-Rubinstei...
Taxonomic metagenome profilers predict the presence and relative abundance of microorganisms from shotgun sequence samples of DNA isolated directly from a microbial community. Over the past years, there has been an explosive growth of software and algorithms for this task, resulting in a need for more systematic comparisons of these methods based o...
The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis, and bipolar disorder. By using high-quality u...
The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis and bipolar disorder. By using high quality un...
We consider the goal of predicting how complex networks respond to chronic (press) perturbations when characterizations of their network topology and interaction strengths are associated with uncertainty. Our primary result is the derivation of exact formulas for the expected number and probability of qualitatively incorrect predictions about a sys...
Background
High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Addi...
Motivation:
Genomic networks represent a complex map of molecular interactions which are descriptive of the biological processes occurring in living cells. Identifying the small over-represented circuitry patterns in these networks helps generate hypotheses about the functional basis of such complex processes. Network motif discovery is a systemat...
Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic dat...
Min hash is a probabilistic method for estimating the similarity of two sets in terms of their Jaccard index, defined as the ration of the size of their intersection to their union. We demonstrate that this method performs best when the sets under consideration are of similar size and the performance degrades considerably when the sets are of very...
Genomic networks represent a complex map of molecular interactions which are descriptive of the biological processes occurring in living cells. Identifying the small over-represented circuitry patterns in these networks helps generate hypotheses about the functional basis of such complex processes. Network motif discovery is a systematic way of ach...
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI)...
IndeCut is the very first practical method for determining the degree of uniformity/independence of network motif discovery algorithm sampling.
Both the weighted and unweighted Unifrac distances have been very successfully employed to assess if two communities differ, but do not give any information about how two communities differ. We take advantage of recent observations that the Unifrac metric is equivalent to the so-called earth mover’s distance (also known as the Kantorovich-Rubinstei...
We consider the goal of predicting how complex networks respond to chronic (press) perturbations when characterizations of their network topology and interaction strengths are associated with uncertainty. Our primary result is the derivation of exact formulas for the expected number and probability of qualitatively incorrect predictions about a sys...