# Sayan MukherjeeDuke University | DU · Department of Statistical Science

Sayan Mukherjee

## About

358

Publications

40,460

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

56,245

Citations

Citations since 2017

Introduction

**Skills and Expertise**

## Publications

Publications (358)

We formulate an ergodic theory for the (almost sure) limit PE˜co of a sequence (PEnco) of successive dynamic imprecise probability kinematics (DIPK, introduced in [10]) updates of a set PE0co representing the initial beliefs of an agent. As a consequence, we formulate a strong law of large numbers.

Ecological relationships between bacteria mediate the services that gut microbiomes provide to their hosts. Knowing the overall direction and strength of these relationships within hosts, and their generalizability across hosts, is essential to learn how microbial ecology scales up to affect microbiome assembly, dynamics, and host health. Here we g...

In the context of a finite admixture model whose components and weights are unknown, if the number of identifiable components is a function of the amount of data collected, we use techniques from stochastic convex geometry to find the growth rate of its expected value. We also show that by placing a Dirichlet process prior on the densities supporte...

Human gut microbial dynamics are highly individualized, making it challenging to link microbiota to health and to design universal microbiome therapies. This individuality is typically attributed to variation in host genetics, diets, environments and medications but it could also emerge from fundamental ecological forces that shape microbiota more...

We state concentration and martingale inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal numbe...

We formulate an ergodic theory for the (almost sure) limit of a sequence of successive dynamic imprecise probability kinematics (DIPK, introduced in Caprio and Gong, 2021) updates of a set representing the initial beliefs of an agent. As a consequence, we formulate a strong law of large numbers.

We propose a new, more general definition of extended probability measures. We study their properties and provide a behavioral interpretation. We put them to use in an inference procedure, whose environment is canonically represented by the probability space $(\Omega,\mathcal{F},P)$, when both $P$ and the composition of $\Omega$ are unknown. We dev...

Identifying structural differences among proteins can be a non-trivial task. When contrasting ensembles of protein structures obtained from molecular dynamics simulations, biologically-relevant features can be easily overshadowed by spurious fluctuations. Here, we present SINATRA Pro, a computational pipeline designed to robustly identify topologic...

We present two classes of abstract prearithmetics, {AM}M≥1 and {BM}M>0. The first one is weakly projective with respect to the nonnegative real Diophantine arithmetic R+=(R+,+,×,≤R+), and the second one is projective with respect to the extended real Diophantine arithmetic R¯=(R¯,+,×,≤R¯). In addition, we have that every AM and every BM is a comple...

A common statistical problem is inference from positive-valued multivariate measurements where the scale (e.g., sum) of the measurements are not representative of the scale (e.g., total size) of the system being studied. This situation is common in the analysis of modern sequencing data. The field of Compositional Data Analysis (CoDA) axiomatically...

Human gut microbial dynamics are highly individualized, making it challenging to link microbiota to health and to design universal microbiome therapies. This individuality is typically attributed to variation in diets, environments, and medications, but it could also emerge from fundamental ecological forces that shape primate microbiota more gener...

Topological transforms have been very useful in statistical analysis of shapes or surfaces without restrictions that the shapes are diffeomorphic and requiring the estimation of correspondence maps. In this paper we introduce two topological transforms that generalize from shapes to fields, $f:\mathbf{R}^3 \rightarrow \mathbf{R}$. Both transforms t...

We introduce a Bayesian model for inferring mixtures of subspaces of different dimensions. The model allows flexible and efficient learning of a density supported in an ambient space which in fact can concentrate around some lower-dimensional space. The key challenge in such a mixture model is specification of prior distributions over subspaces of...

Identifying structural differences among proteins can be a non-trivial task. When contrasting ensembles of protein structures obtained from molecular dynamics simulations, biologically-relevant features can be easily overshadowed by spurious fluctuations. Here, we present SINATRA Pro, a computational pipeline designed to robustly identify topologic...

PCR amplification plays an integral role in the measurement of mixed microbial communities via high-throughput DNA sequencing of the 16S ribosomal RNA (rRNA) gene. Yet PCR is also known to introduce multiple forms of bias in 16S rRNA studies. Here we present a paired modeling and experimental approach to characterize and mitigate PCR NPM-bias (PCR...

In some mammals and many social insects, highly cooperative societies are characterized by reproductive division of labor, in which breeders and nonbreeders become behaviorally and morphologically distinct. While differences in behavior and growth between breeders and nonbreeders have been extensively described, little is known of their molecular u...

We present three classes of abstract prearithmetics, $\{\mathbf{A}_M\}_{M \geq 1}$, $\{\mathbf{A}_{-M,M}\}_{M \geq 1}$, and $\{\mathbf{B}_M\}_{M > 0}$. The first one is weakly projective with respect to the nonnegative real Diophantine arithmetic $\mathbf{R_+}=(\mathbb{R}_+,+,\times,\leq_{\mathbb{R}_+})$, the second one is weakly projective with re...

We develop a geometric framework that characterizes the synchronization problem --- the problem of consistently registering or aligning a collection of objects. The theory we formulate characterizes the cohomological nature of synchronization based on the classical theory of fibre bundles. We first establish the correspondence between synchronizati...

Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and...

In some mammals and many social insects, highly cooperative societies are characterized by reproductive division of labor, in which breeders and nonbreeders become behaviorally and morphologically distinct. While differences in behavior and growth between breeders and nonbreeders have been extensively described, little is known of their molecular u...

Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer that is under active study in the field of cancer biology. Its rapid progression and the relative time cost of obtaining molecular data make other readily-available forms of data, such as images, an important resource for actionable measures in patients. Our goal is to utiliz...

It has been a longstanding challenge in geometric morphometrics and medical imaging to infer the physical locations (or regions) of 3D shapes that are most associated with a given response variable (e.g.~class labels) without needing common predefined landmarks across the shapes, computing correspondence maps between the shapes, or requiring the sh...

Human tumors have distinct profiles of genomic alterations, and each of these alterations has the potential to cause unique changes to cellular homeostasis. Detailed analyses of these changes could reveal downstream effects of genomic alterations, contributing to our understanding of their roles in tumor development and progression. Across a range...

PCR amplification plays a central role in the measurement of mixed microbial communities via high-throughput sequencing. Yet PCR is also known to be a common source of bias in microbiome data. Here we present a paired modeling and experimental approach to characterize and mitigate PCR bias in microbiome studies. We use experimental data from mock b...

Changes in gene regulation have long been thought to play an important role in primate evolution. However, although a number of studies have compared genome-wide gene expression patterns across primate species, fewer have investigated the gene regulatory mechanisms that underlie such patterns, or the relative contribution of drift versus selection....

The problem of pattern and scale is a central challenge in ecology. In community ecology, an important scale is that at which we aggregate species to define our units of study, such as aggregation of “nitrogen fixing trees” to understand patterns in carbon sequestration. With the emergence of massive community ecological data sets, there is a need...

In this paper we consider a Bayesian framework for making inferences about dynamical systems from ergodic observations. The proposed Bayesian procedure is based on the Gibbs posterior, a decision theoretic generalization of standard Bayesian inference. We place a prior over a model class consisting of a parametrized family of Gibbs measures on a mi...

We develop a Gaussian-process mixture model for heterogeneous treatment effect estimation that leverages the use of transformed outcomes. The approach we will present attempts to improve point estimation and uncertainty quantification relative to past work that has used transformed variable related methods as well as traditional outcome modeling. E...

Abstract Following publication of the original article [1], the authors noticed an error in the presentation of equations in the PDF version.

Due to the advent and utility of high-throughput sequencing, modern biomedical research abounds with multivariate count data. Yet such sequence count data is often extremely sparse; that is, much of the data is zero values. Such zero values are well known to cause problems for statistical analyses. In this work we provide a systematic description o...

The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a...

Background:
Artificial gut models provide unique opportunities to study human-associated microbiota. Outstanding questions for these models' fundamental biology include the timescales on which microbiota vary and the factors that drive such change. Answering these questions though requires overcoming analytical obstacles like estimating the effect...

Longitudinal studies of microbial communities have emphasized that host-associated microbiota are highly dynamic as well as underscoring the potential biomedical relevance of understanding these dynamics. Despite this increasing appreciation, statistical challenges in the design and analysis of longitudinal microbiome studies such as sequence count...

Changes in gene regulation have long been thought to play an important role in primate evolution. However, although a number of studies have compared genome-wide gene expression patterns across primate species, fewer have investigated the gene regulatory mechanisms that underlie such patterns, or the relative contribution of drift versus selection....

Linear mixed models (LMMs) are used extensively to model observations that are not independent. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimension $p$ of the covariates, and often use heuristics that do n...

Artificial gut models provide unique opportunities to study human-associated microbiota. Outstanding questions for these models’ fundamental biology include the timescales on which microbiota vary and the factors that drive such change. Answering these questions though requires overcoming analytical obstacles like estimating the effects of technica...

Linear mixed models (LMMs) are used extensively to model dependecies of observations in linear regression and are used extensively in many application areas. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimen...

We propose a representation of Gaussian processes (GPs) based on powers of the integral operator defined by a kernel function, we call these stochastic processes integral Gaussian processes (IGPs). Sample paths from IGPs are functions contained within the reproducing kernel Hilbert space (RKHS) defined by the kernel function, in contrast sample pat...

The problem of pattern and scale is a central challenge in ecology. The problem of scale is central to community ecology, where functional ecological groups are aggregated and treated as a unit underlying an ecological pattern, such as aggregation of “nitrogen fixing trees” into a total abundance of a trait underlying ecosystem physiology. With the...

Diverse pathways drive resistance to BRAF/MEK inhibitors in BRAF-mutant melanoma, suggesting that durable control of resistance will be a challenge. By combining statistical modeling of genomic data from matched pre-treatment and post-relapse patient tumors with functional interrogation of >20 inÂ vitro and inÂ vivo resistance models, we discovered...

Automated geometric morphometric methods are promising tools for shape analysis in comparative biology, improving researchers' abilities to quantify variation extensively (by permitting more specimens to be analyzed) and intensively (by characterizing shapes with greater fidelity). Although use of these methods has increased, published automated me...

Nonlinear kernel regression models are often used in statistics and machine learning because they are more accurate than linear models. Variable selection for kernel regression models is a challenge partly because, unlike the linear regression setting, there is no clear concept of an effect size for regression coefficients. In this paper, we propos...

We present an efficient algorithm for learning mixed membership models when the number of variables $p$ is much larger than the number of hidden components $k$. This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an $O\left(p^3\right)$ tensor, to factorizing $O\left(p/k\right)$ sub-tenso...

Epistasis, commonly defined as the interaction between multiple genes, is an important genetic component underlying phenotypic variation. Many statistical methods have been developed to model and identify epistatic interactions between genetic variants. However, because of the large combinatorial search space of interactions, most epistasis mapping...

Power to detect pairwise epistatic heritability across all simulation scenarios.
Compared here is the power of the standard variance component model to estimate the true non-zero pairwise epistatic PVE at the significance level of α = 0.05 under a standard asymptotic normal test. Each simulation scenario is represented by a different color, with ea...

Empirical power of exhaustive search procedures to detect epistatic pairs.
Here, the effectiveness of MAPIT (green) as an initial step in a pairwise detection filtration process is compared against the more conventional single-SNP testing procedure, which is carried out via GEMMA (purple). In both cases, the search for epistatic pairs occurs betwee...

Enrichment of mepiQTL SNPs in GEUVADIS data set after using MAPIT with a genome-wide relatedness matrix.
Shown here are the distribution of locations for significant SNPs, relative to the 5′ most gene transcription start site (TSS) and the 3′ most gene transcription end site (TES). (A) displays the marginally epistatic QTL (mepiQTL) detected by MAP...

Chromosome-wide scans for epistatic effects in GEUVADIS data set.
Depicted are the −log10(P) transformed MAPIT p-values of quality-control-positive cis-SNPs plotted against their genomic position in chromosomes (A) 2, (B) 3, and (C) 4, respectively. Note that MAPIT was implemented with Kcis. Here, the epistatic associated genes are labeled (blue)....

Chromosome-wide scans for epistatic effects in GEUVADIS data set.
Depicted are the −log10(P) transformed MAPIT p-values of quality-control-positive cis-SNPs plotted against their genomic position in chromosomes (A) 9, (B) 10, and (C) 11, respectively. Note that MAPIT was implemented with Kcis. Here, the epistatic associated genes are labeled (blue)...

Chromosome-wide scans for epistatic effects in GEUVADIS data set.
Depicted are the −log10(P) transformed MAPIT p-values of quality-control-positive cis-SNPs plotted against their genomic position in chromosomes (A) 15, (B) 16, and (C) 17, respectively. Note that MAPIT was implemented with Kcis. Here, the epistatic associated genes are labeled (blue...

Estimates of the proportion of phenotypic variance explained (PVE) by additive and pairwise epistatic effects for each gene analyzed in the GEUVADIS data set.
Estimates of the pPVE on the y-axis were calculated by using variance component models, where each of the components represent for additive effects (grey) and pairwise epistasis (green). More...

Percentage of overlap (i.e. coverage) between the mepiGenes detected by MAPIT, the eGenes identified by GEMMA, and the epiGenes found by PLINK.
Coverage was computed as the proportion of significant genes detected by row j that were also identified by column k.
(XLSX)

The marginal epistatic p-values for all significant mepiQTL as computed by MAPIT in the GEUVADIS data set using the cis-gene specific genetic relatedness matrix Kcis.
Strong significance of association for a particular SNP or locus was determined by using a gene specific Bonferroni-corrected significance p-value threshold P = 0.05/∑i
si, where si i...

The marginal epistatic p-values for all significant mepiQTL as computed by MAPIT in the GEUVADIS data set using the genome-wide specific genetic relatedness matrix KPop.
Strong significance of association for a particular SNP or locus was determined by using a gene specific Bonferroni-corrected significance p-value threshold P = 0.05/∑i
si, where s...

Power analysis for detecting group 1 and group 2 causal SNPs in the presence of population stratification effects (Top 5 PCs).
We compare the mapping abilities of MAPIT (solid line) to the exhaustive search procedure in PLINK (dotted line) in scenarios I (A), II (B), III (C), and IV (D), under broad-sense heritability level H2 = 0.6 and ρ = 0.8. He...

Power analysis for detecting group 1 and group 2 causal SNPs in the presence of population stratification effects (Top 10 PCs).
We compare the mapping abilities of MAPIT (solid line) to the exhaustive search procedure in PLINK (dotted line) in scenarios I (A), II (B), III (C), and IV (D), under broad-sense heritability level H2 = 0.6 and ρ = 0.5. H...

Power analysis for detecting group 1 and group 2 causal SNPs in the presence of population stratification effects (Top 10 PCs).
We compare the mapping abilities of MAPIT (solid line) to the exhaustive search procedure in PLINK (dotted line) in scenarios I (A), II (B), III (C), and IV (D), under broad-sense heritability level H2 = 0.6 and ρ = 0.8. H...

Chromosome-wide scans for epistatic effects in GEUVADIS data set.
Depicted are the −log10(P) transformed MAPIT p-values of quality-control-positive cis-SNPs plotted against their genomic position in chromosomes (A) 5, (B) 7, and (C) 8, respectively. Note that MAPIT was implemented with Kcis. Here, the epistatic associated genes are labeled (blue)....

Empirical type I error estimates of MAPIT in the presence of population stratification effects (Top 5 PCs).
Each entry represents type I error rate estimates as the proportion of p-values a under the null hypothesis based on 100 simulated continuous phenotypes for the normal test (or z-test) and the Davies method. These results are based on 100 sim...

Empirical type I error estimates of MAPIT in the presence of population stratification effects (Top 10 PCs).
Each entry represents type I error rate estimates as the proportion of p-values a under the null hypothesis based on 100 simulated continuous phenotypes for the normal test (or z-test) and the Davies method. These results are based on 100 si...

The marginal epistatic p-values for all significant mepiQTL as computed by MAPIT in the GEUVADIS data set using the cis-gene specific genetic relatedness matrix Ktrans.
Strong significance of association for a particular SNP or locus was determined by using a gene specific Bonferroni-corrected significance p-value threshold P = 0.05/∑i
si, where si...

Empirical power to detect simulated causal interacting makers.
(A) and (B) show the power of MAPIT to identify SNPs in each causal group under the Bonferroni-corrected genome-wide significance level α = 8.3 × 10−6. Groups 1 and 2 causal markers are colored in light red and light blue, respectively. These figures are based on a broad-sense heritabil...

Accuracy of total pairwise epistatic PVE estimates across all simulation scenarios.
Compared here are the epistatic PVE estimates computed by the standard variance component model. Each simulation scenario is represented by a different color, with each of the three simulation schemes being labeled on the x-axis. These figures are based on 100 simul...

Comparison of epistatic filtration methods with MAPIT and GEMMA on the GEUVADIS data set.
(A)-(C) show a histograms of the MAPIT p-values for all variants in the GEUVADIS data set using the genome-wide genetic relatedness matrix Ktrans (A), KGW (B), and KPop (C), respectively. The horizontal red line corresponds to a uniform distribution of p-value...