Sayan Mukherjee

Sayan Mukherjee
Duke University | DU · Department of Statistical Science

About

358
Publications
40,460
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
56,245
Citations
Citations since 2017
116 Research Items
29406 Citations
201720182019202020212022202301,0002,0003,0004,0005,0006,000
201720182019202020212022202301,0002,0003,0004,0005,0006,000
201720182019202020212022202301,0002,0003,0004,0005,0006,000
201720182019202020212022202301,0002,0003,0004,0005,0006,000
Introduction
Skills and Expertise

Publications

Publications (358)
Article
We formulate an ergodic theory for the (almost sure) limit PE˜co of a sequence (PEnco) of successive dynamic imprecise probability kinematics (DIPK, introduced in [10]) updates of a set PE0co representing the initial beliefs of an agent. As a consequence, we formulate a strong law of large numbers.
Preprint
Full-text available
Ecological relationships between bacteria mediate the services that gut microbiomes provide to their hosts. Knowing the overall direction and strength of these relationships within hosts, and their generalizability across hosts, is essential to learn how microbial ecology scales up to affect microbiome assembly, dynamics, and host health. Here we g...
Preprint
Full-text available
In the context of a finite admixture model whose components and weights are unknown, if the number of identifiable components is a function of the amount of data collected, we use techniques from stochastic convex geometry to find the growth rate of its expected value. We also show that by placing a Dirichlet process prior on the densities supporte...
Article
Full-text available
Human gut microbial dynamics are highly individualized, making it challenging to link microbiota to health and to design universal microbiome therapies. This individuality is typically attributed to variation in host genetics, diets, environments and medications but it could also emerge from fundamental ecological forces that shape microbiota more...
Preprint
Full-text available
We state concentration and martingale inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal numbe...
Preprint
Full-text available
We formulate an ergodic theory for the (almost sure) limit of a sequence of successive dynamic imprecise probability kinematics (DIPK, introduced in Caprio and Gong, 2021) updates of a set representing the initial beliefs of an agent. As a consequence, we formulate a strong law of large numbers.
Preprint
Full-text available
We propose a new, more general definition of extended probability measures. We study their properties and provide a behavioral interpretation. We put them to use in an inference procedure, whose environment is canonically represented by the probability space $(\Omega,\mathcal{F},P)$, when both $P$ and the composition of $\Omega$ are unknown. We dev...
Article
Full-text available
Identifying structural differences among proteins can be a non-trivial task. When contrasting ensembles of protein structures obtained from molecular dynamics simulations, biologically-relevant features can be easily overshadowed by spurious fluctuations. Here, we present SINATRA Pro, a computational pipeline designed to robustly identify topologic...
Conference Paper
Full-text available
We present two classes of abstract prearithmetics, {AM}M≥1 and {BM}M>0. The first one is weakly projective with respect to the nonnegative real Diophantine arithmetic R+=(R+,+,×,≤R+), and the second one is projective with respect to the extended real Diophantine arithmetic R¯=(R¯,+,×,≤R¯). In addition, we have that every AM and every BM is a comple...
Preprint
Full-text available
A common statistical problem is inference from positive-valued multivariate measurements where the scale (e.g., sum) of the measurements are not representative of the scale (e.g., total size) of the system being studied. This situation is common in the analysis of modern sequencing data. The field of Compositional Data Analysis (CoDA) axiomatically...
Preprint
Full-text available
Human gut microbial dynamics are highly individualized, making it challenging to link microbiota to health and to design universal microbiome therapies. This individuality is typically attributed to variation in diets, environments, and medications, but it could also emerge from fundamental ecological forces that shape primate microbiota more gener...
Preprint
Topological transforms have been very useful in statistical analysis of shapes or surfaces without restrictions that the shapes are diffeomorphic and requiring the estimation of correspondence maps. In this paper we introduce two topological transforms that generalize from shapes to fields, $f:\mathbf{R}^3 \rightarrow \mathbf{R}$. Both transforms t...
Article
We introduce a Bayesian model for inferring mixtures of subspaces of different dimensions. The model allows flexible and efficient learning of a density supported in an ambient space which in fact can concentrate around some lower-dimensional space. The key challenge in such a mixture model is specification of prior distributions over subspaces of...
Preprint
Full-text available
Identifying structural differences among proteins can be a non-trivial task. When contrasting ensembles of protein structures obtained from molecular dynamics simulations, biologically-relevant features can be easily overshadowed by spurious fluctuations. Here, we present SINATRA Pro, a computational pipeline designed to robustly identify topologic...
Article
Full-text available
PCR amplification plays an integral role in the measurement of mixed microbial communities via high-throughput DNA sequencing of the 16S ribosomal RNA (rRNA) gene. Yet PCR is also known to introduce multiple forms of bias in 16S rRNA studies. Here we present a paired modeling and experimental approach to characterize and mitigate PCR NPM-bias (PCR...
Article
Full-text available
In some mammals and many social insects, highly cooperative societies are characterized by reproductive division of labor, in which breeders and nonbreeders become behaviorally and morphologically distinct. While differences in behavior and growth between breeders and nonbreeders have been extensively described, little is known of their molecular u...
Preprint
Full-text available
We present three classes of abstract prearithmetics, $\{\mathbf{A}_M\}_{M \geq 1}$, $\{\mathbf{A}_{-M,M}\}_{M \geq 1}$, and $\{\mathbf{B}_M\}_{M > 0}$. The first one is weakly projective with respect to the nonnegative real Diophantine arithmetic $\mathbf{R_+}=(\mathbb{R}_+,+,\times,\leq_{\mathbb{R}_+})$, the second one is weakly projective with re...
Article
Full-text available
We develop a geometric framework that characterizes the synchronization problem --- the problem of consistently registering or aligning a collection of objects. The theory we formulate characterizes the cohomological nature of synchronization based on the classical theory of fibre bundles. We first establish the correspondence between synchronizati...
Article
Full-text available
Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and...
Preprint
Full-text available
In some mammals and many social insects, highly cooperative societies are characterized by reproductive division of labor, in which breeders and nonbreeders become behaviorally and morphologically distinct. While differences in behavior and growth between breeders and nonbreeders have been extensively described, little is known of their molecular u...
Article
Full-text available
Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer that is under active study in the field of cancer biology. Its rapid progression and the relative time cost of obtaining molecular data make other readily-available forms of data, such as images, an important resource for actionable measures in patients. Our goal is to utiliz...
Preprint
Full-text available
It has been a longstanding challenge in geometric morphometrics and medical imaging to infer the physical locations (or regions) of 3D shapes that are most associated with a given response variable (e.g.~class labels) without needing common predefined landmarks across the shapes, computing correspondence maps between the shapes, or requiring the sh...
Article
Full-text available
Human tumors have distinct profiles of genomic alterations, and each of these alterations has the potential to cause unique changes to cellular homeostasis. Detailed analyses of these changes could reveal downstream effects of genomic alterations, contributing to our understanding of their roles in tumor development and progression. Across a range...
Preprint
Full-text available
PCR amplification plays a central role in the measurement of mixed microbial communities via high-throughput sequencing. Yet PCR is also known to be a common source of bias in microbiome data. Here we present a paired modeling and experimental approach to characterize and mitigate PCR bias in microbiome studies. We use experimental data from mock b...
Article
Full-text available
Changes in gene regulation have long been thought to play an important role in primate evolution. However, although a number of studies have compared genome-wide gene expression patterns across primate species, fewer have investigated the gene regulatory mechanisms that underlie such patterns, or the relative contribution of drift versus selection....
Article
The problem of pattern and scale is a central challenge in ecology. In community ecology, an important scale is that at which we aggregate species to define our units of study, such as aggregation of “nitrogen fixing trees” to understand patterns in carbon sequestration. With the emergence of massive community ecological data sets, there is a need...
Preprint
In this paper we consider a Bayesian framework for making inferences about dynamical systems from ergodic observations. The proposed Bayesian procedure is based on the Gibbs posterior, a decision theoretic generalization of standard Bayesian inference. We place a prior over a model class consisting of a parametrized family of Gibbs measures on a mi...
Preprint
We develop a Gaussian-process mixture model for heterogeneous treatment effect estimation that leverages the use of transformed outcomes. The approach we will present attempts to improve point estimation and uncertainty quantification relative to past work that has used transformed variable related methods as well as traditional outcome modeling. E...
Article
Full-text available
Abstract Following publication of the original article [1], the authors noticed an error in the presentation of equations in the PDF version.
Preprint
Full-text available
Due to the advent and utility of high-throughput sequencing, modern biomedical research abounds with multivariate count data. Yet such sequence count data is often extremely sparse; that is, much of the data is zero values. Such zero values are well known to cause problems for statistical analyses. In this work we provide a systematic description o...
Preprint
Full-text available
The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a...
Article
Full-text available
Background: Artificial gut models provide unique opportunities to study human-associated microbiota. Outstanding questions for these models' fundamental biology include the timescales on which microbiota vary and the factors that drive such change. Answering these questions though requires overcoming analytical obstacles like estimating the effect...
Preprint
Full-text available
Longitudinal studies of microbial communities have emphasized that host-associated microbiota are highly dynamic as well as underscoring the potential biomedical relevance of understanding these dynamics. Despite this increasing appreciation, statistical challenges in the design and analysis of longitudinal microbiome studies such as sequence count...
Preprint
Full-text available
Changes in gene regulation have long been thought to play an important role in primate evolution. However, although a number of studies have compared genome-wide gene expression patterns across primate species, fewer have investigated the gene regulatory mechanisms that underlie such patterns, or the relative contribution of drift versus selection....
Conference Paper
Linear mixed models (LMMs) are used extensively to model observations that are not independent. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimension $p$ of the covariates, and often use heuristics that do n...
Preprint
Full-text available
Artificial gut models provide unique opportunities to study human-associated microbiota. Outstanding questions for these models’ fundamental biology include the timescales on which microbiota vary and the factors that drive such change. Answering these questions though requires overcoming analytical obstacles like estimating the effects of technica...
Article
Linear mixed models (LMMs) are used extensively to model dependecies of observations in linear regression and are used extensively in many application areas. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimen...
Preprint
We propose a representation of Gaussian processes (GPs) based on powers of the integral operator defined by a kernel function, we call these stochastic processes integral Gaussian processes (IGPs). Sample paths from IGPs are functions contained within the reproducing kernel Hilbert space (RKHS) defined by the kernel function, in contrast sample pat...
Preprint
Full-text available
The problem of pattern and scale is a central challenge in ecology. The problem of scale is central to community ecology, where functional ecological groups are aggregated and treated as a unit underlying an ecological pattern, such as aggregation of “nitrogen fixing trees” into a total abundance of a trait underlying ecosystem physiology. With the...
Article
Full-text available
Diverse pathways drive resistance to BRAF/MEK inhibitors in BRAF-mutant melanoma, suggesting that durable control of resistance will be a challenge. By combining statistical modeling of genomic data from matched pre-treatment and post-relapse patient tumors with functional interrogation of >20 in vitro and in vivo resistance models, we discovered...
Article
Automated geometric morphometric methods are promising tools for shape analysis in comparative biology, improving researchers' abilities to quantify variation extensively (by permitting more specimens to be analyzed) and intensively (by characterizing shapes with greater fidelity). Although use of these methods has increased, published automated me...
Article
Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propos...
Conference Paper
We present an efficient algorithm for learning mixed membership models when the number of variables $p$ is much larger than the number of hidden components $k$. This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an $O\left(p^3\right)$ tensor, to factorizing $O\left(p/k\right)$ sub-tenso...
Article
Full-text available
Epistasis, commonly defined as the interaction between multiple genes, is an important genetic component underlying phenotypic variation. Many statistical methods have been developed to model and identify epistatic interactions between genetic variants. However, because of the large combinatorial search space of interactions, most epistasis mapping...
Data
Power to detect pairwise epistatic heritability across all simulation scenarios. Compared here is the power of the standard variance component model to estimate the true non-zero pairwise epistatic PVE at the significance level of α = 0.05 under a standard asymptotic normal test. Each simulation scenario is represented by a different color, with ea...
Data
Empirical power of exhaustive search procedures to detect epistatic pairs. Here, the effectiveness of MAPIT (green) as an initial step in a pairwise detection filtration process is compared against the more conventional single-SNP testing procedure, which is carried out via GEMMA (purple). In both cases, the search for epistatic pairs occurs betwee...
Data
Enrichment of mepiQTL SNPs in GEUVADIS data set after using MAPIT with a genome-wide relatedness matrix. Shown here are the distribution of locations for significant SNPs, relative to the 5′ most gene transcription start site (TSS) and the 3′ most gene transcription end site (TES). (A) displays the marginally epistatic QTL (mepiQTL) detected by MAP...
Data
Chromosome-wide scans for epistatic effects in GEUVADIS data set. Depicted are the −log10(P) transformed MAPIT p-values of quality-control-positive cis-SNPs plotted against their genomic position in chromosomes (A) 2, (B) 3, and (C) 4, respectively. Note that MAPIT was implemented with Kcis. Here, the epistatic associated genes are labeled (blue)....
Data
Chromosome-wide scans for epistatic effects in GEUVADIS data set. Depicted are the −log10(P) transformed MAPIT p-values of quality-control-positive cis-SNPs plotted against their genomic position in chromosomes (A) 9, (B) 10, and (C) 11, respectively. Note that MAPIT was implemented with Kcis. Here, the epistatic associated genes are labeled (blue)...
Data
Chromosome-wide scans for epistatic effects in GEUVADIS data set. Depicted are the −log10(P) transformed MAPIT p-values of quality-control-positive cis-SNPs plotted against their genomic position in chromosomes (A) 15, (B) 16, and (C) 17, respectively. Note that MAPIT was implemented with Kcis. Here, the epistatic associated genes are labeled (blue...
Data
Estimates of the proportion of phenotypic variance explained (PVE) by additive and pairwise epistatic effects for each gene analyzed in the GEUVADIS data set. Estimates of the pPVE on the y-axis were calculated by using variance component models, where each of the components represent for additive effects (grey) and pairwise epistasis (green). More...
Data
Percentage of overlap (i.e. coverage) between the mepiGenes detected by MAPIT, the eGenes identified by GEMMA, and the epiGenes found by PLINK. Coverage was computed as the proportion of significant genes detected by row j that were also identified by column k. (XLSX)
Data
The marginal epistatic p-values for all significant mepiQTL as computed by MAPIT in the GEUVADIS data set using the cis-gene specific genetic relatedness matrix Kcis. Strong significance of association for a particular SNP or locus was determined by using a gene specific Bonferroni-corrected significance p-value threshold P = 0.05/∑i si, where si i...
Data
The marginal epistatic p-values for all significant mepiQTL as computed by MAPIT in the GEUVADIS data set using the genome-wide specific genetic relatedness matrix KPop. Strong significance of association for a particular SNP or locus was determined by using a gene specific Bonferroni-corrected significance p-value threshold P = 0.05/∑i si, where s...
Data
Power analysis for detecting group 1 and group 2 causal SNPs in the presence of population stratification effects (Top 5 PCs). We compare the mapping abilities of MAPIT (solid line) to the exhaustive search procedure in PLINK (dotted line) in scenarios I (A), II (B), III (C), and IV (D), under broad-sense heritability level H2 = 0.6 and ρ = 0.8. He...
Data
Power analysis for detecting group 1 and group 2 causal SNPs in the presence of population stratification effects (Top 10 PCs). We compare the mapping abilities of MAPIT (solid line) to the exhaustive search procedure in PLINK (dotted line) in scenarios I (A), II (B), III (C), and IV (D), under broad-sense heritability level H2 = 0.6 and ρ = 0.5. H...
Data
Power analysis for detecting group 1 and group 2 causal SNPs in the presence of population stratification effects (Top 10 PCs). We compare the mapping abilities of MAPIT (solid line) to the exhaustive search procedure in PLINK (dotted line) in scenarios I (A), II (B), III (C), and IV (D), under broad-sense heritability level H2 = 0.6 and ρ = 0.8. H...
Data
Chromosome-wide scans for epistatic effects in GEUVADIS data set. Depicted are the −log10(P) transformed MAPIT p-values of quality-control-positive cis-SNPs plotted against their genomic position in chromosomes (A) 5, (B) 7, and (C) 8, respectively. Note that MAPIT was implemented with Kcis. Here, the epistatic associated genes are labeled (blue)....
Data
Empirical type I error estimates of MAPIT in the presence of population stratification effects (Top 5 PCs). Each entry represents type I error rate estimates as the proportion of p-values a under the null hypothesis based on 100 simulated continuous phenotypes for the normal test (or z-test) and the Davies method. These results are based on 100 sim...
Data
Empirical type I error estimates of MAPIT in the presence of population stratification effects (Top 10 PCs). Each entry represents type I error rate estimates as the proportion of p-values a under the null hypothesis based on 100 simulated continuous phenotypes for the normal test (or z-test) and the Davies method. These results are based on 100 si...
Data
The marginal epistatic p-values for all significant mepiQTL as computed by MAPIT in the GEUVADIS data set using the cis-gene specific genetic relatedness matrix Ktrans. Strong significance of association for a particular SNP or locus was determined by using a gene specific Bonferroni-corrected significance p-value threshold P = 0.05/∑i si, where si...
Data
Empirical power to detect simulated causal interacting makers. (A) and (B) show the power of MAPIT to identify SNPs in each causal group under the Bonferroni-corrected genome-wide significance level α = 8.3 × 10−6. Groups 1 and 2 causal markers are colored in light red and light blue, respectively. These figures are based on a broad-sense heritabil...
Data
Accuracy of total pairwise epistatic PVE estimates across all simulation scenarios. Compared here are the epistatic PVE estimates computed by the standard variance component model. Each simulation scenario is represented by a different color, with each of the three simulation schemes being labeled on the x-axis. These figures are based on 100 simul...
Data
Comparison of epistatic filtration methods with MAPIT and GEMMA on the GEUVADIS data set. (A)-(C) show a histograms of the MAPIT p-values for all variants in the GEUVADIS data set using the genome-wide genetic relatedness matrix Ktrans (A), KGW (B), and KPop (C), respectively. The horizontal red line corresponds to a uniform distribution of p-value...