Himel MallickCornell University | CU · Department of Population Health Sciences
Himel Mallick
Biostatistics PhD - Fellow of the American Statistical Association
Check out our new differential analysis tool for multi-omics: github.com/himelmallick/Tweedieverse
About
86
Publications
33,668
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,907
Citations
Introduction
I am a tenure-track Principal Investigator researching at Cornell University's Department of Population Health Sciences.
I develop computational methods to generate and validate testable hypotheses that accelerate data-driven discovery.
Prior to Cornell, I was an Associate Principal Scientist at Merck Research Laboratories and a postdoctoral fellow of Computational Biology and Bioinformatics at Harvard University.
Additional affiliations
March 2019 - April 2023
June 2018 - July 2018
Education
October 2015 - March 2019
October 2015 - March 2019
August 2009 - August 2015
Publications
Publications (86)
Objective:
To develop and validate a mortality risk algorithm for obese black and white men and women to elucidate risk factors prognostic of short-term mortality among obese persons.
Methods:
Prospective cohort study. Reasons for geographic and racial differences in stroke (REGARDS) study, is a cohort of black and white men and women aged ≥45 y...
In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional varia...
Complex human diseases usually have multifactorial causes, and may develop as a result of the collective effects of multiple genetic variants, complex gene-gene/gene-environment interactions, rare sequence variants, copy number alterations, epigenetic modifications, etc. Understanding the genetic aetiology of complex human diseases require a compre...
A Bayesian bi-level variable selection method (BAGB: Bayesian Analysis of Group Bridge) is developed for regularized regression and classification. This new development is motivated by grouped data, where generic variables can be divided into multiple groups, with variables in the same group being mechanistically related or statistically correlated...
A reciprocal LASSO (rLASSO) regularization employs a decreasing penalty function as opposed to conventional penalization approaches that use increasing penalties on the coefficients, leading to stronger parsimony and superior model selection relative to traditional shrinkage methods. Here we consider a fully Bayesian formulation of the rLASSO probl...
It is increasingly common in a wide variety of applied settings to collect data of multiple different types on the same set of samples. Our particular focus in this article is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are colle...
Despite the availability of several high-profile, state-of-the-art methods, analyzing bulk RNA-Seq data continues to face significant challenges. Evidence from recent studies has highlighted that popular differential expression (DE) tools, such as edgeR and DESeq2, are susceptible to an alarmingly high false discovery rate (FDR). These studies sugg...
With the growing commonality of multi‐omics datasets, there is now increasing evidence that integrated omics profiles lead to more efficient discovery of clinically actionable biomarkers that enable better disease outcome prediction and patient stratification. Several methods exist to perform host phenotype prediction from cross‐sectional, single‐o...
The primary goal of this article is to extend the reciprocal LASSO for applications to binary and survival outcomes. We consider the least squares approximation (LSA) as a solver for the reciprocal LASSO problem. The LSA is a general theoretical framework that includes generalized linear models, Cox regression, and many others as special cases. By...
With the growing commonality of multi-omics datasets, there is now increasing evidence that integrated omics profiles lead to the more efficient discovery of clinically actionable biomarkers that enable better disease outcome prediction and patient stratification. Several methods exist to perform host phenotype prediction from cross-sectional, sing...
Microbiome studies of inflammatory bowel diseases (IBD) have achieved a scale for meta-analysis of dysbioses among populations. To enable microbial community meta-analyses generally, we develop MMUPHin for normalization, statistical meta-analysis, and population structure discovery using microbial taxonomic and functional profiles. Applying it to t...
The performance of computational methods and software to identify differentially expressed features in single‐cell RNA‐sequencing (scRNA‐seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expressi...
It is challenging to associate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, extremely non-normal, and often in the form of count or comp...
Many methods have been developed for statistical analysis of microbial community profiles, but due to the complex nature of typical microbiome measurements (e.g. sparsity, zero-inflation, non-independence, and compositionality) and of the associated underlying biology, it is difficult to compare or evaluate such methods within a single systematic f...
Motivation
The discovery of biologically interpretable and clinically actionable communities in heterogeneous omics data is a necessary first step towards deriving mechanistic insights into complex biological phenomena. Here we present a novel clustering approach, omeClust, for community detection in omics profiles by simultaneously incorporating s...
S ummary
The performance of computational methods and software to identify differentially expressed genes in single-cell RNA-sequencing (scRNA-seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene ex...
Many methods have been developed for statistical analysis of microbial community profiles, but due to the complex nature of typical microbiome measurements (e.g. sparsity, zero-inflation, non-independence, and compositionality) and of the associated underlying biology, it is difficult to compare or evaluate such methods within a single systematic f...
It is challenging to associate features such as human health outcomes, diet, environmental conditions, or other metadata to microbial community measurements, due in part to their quantitative properties. Microbiome multi-omics are typically noisy, sparse (zero-inflated), high-dimensional, extremely non-normal, and often in the form of count or comp...
The reciprocal LASSO estimate for linear regression corresponds to a posterior mode when independent inverse Laplace priors are assigned on the regression coefficients. This paper studies reciprocal LASSO in quantile regression from a Bayesian perspective. Simple and efficient Gibbs sampling algorithms are developed for posterior inference using a...
Microbial community studies in general, and of the human microbiome in inflammatory bowel disease (IBD) in particular, have now achieved a scale at which it is practical to associate features of the microbiome with environmental exposures and health outcomes across multiple large-scale populations. This permits the development of rigorous meta-anal...
There has never been a more exciting time to do microbiome research thanks to the recent completion of several population-scale, longitudinal multi-omics studies including the NIH integrative human microbiome project (iHMP; iHMP Consortium, 2019) that have facilitated a multitude of new avenues of research for future investigations. These breakthro...
The reciprocal LASSO estimate for linear regression corresponds to a posterior mode when independent inverse Laplace priors are assigned on the regression coefficients. This paper studies reciprocal LASSO in quantile regression from a Bayesian perspective. Simple and efficient Gibbs sampling algorithms are developed for posterior inference using a...
Background & aims:
Sulfur-metabolizing microbes, which convert dietary sources of sulfur into genotoxic hydrogen sulfide (H2S), have been associated with development of colorectal cancer (CRC). We identified a dietary pattern associated with sulfur-metabolizing bacteria in stool and then investigated its association with risk of incident CRC using...
A mosaic of cross-phylum chemical interactions occurs between all metazoans and their microbiomes. A number of molecular families that are known to be produced by the microbiome have a marked effect on the balance between health and disease1–9. Considering the diversity of the human microbiome (which numbers over 40,000 operational taxonomic units1...
A reciprocal LASSO (rLASSO) regularization employs a decreasing penalty function as opposed to conventional penalization methods that use increasing penalties on the coefficients, leading to stronger parsimony and superior model selection relative to traditional shrinkage methods. Here we consider a fully Bayesian formulation of the rLASSO problem,...
Microbial community metabolomics, particularly in the human gut, are beginning to provide a new route to identify functions and ecology disrupted in disease. However, these data can be costly and difficult to obtain at scale, while amplicon or shotgun metagenomic sequencing data are readily available for populations of many thousands. Here, we desc...
Zero-inflated count models have received considerable amount of attention in recent years, fuelled by their widespread applications in many scientific disciplines. In this paper, we consider the problem of selecting grouped variables in zero-inflated Poisson (ZIP) models via group bridge regularization. The ZIP mixture likelihood with a group-wise...
In the Supplementary Tables 2, 4 and 6 originally published with this Article, the authors mistakenly included sample identifiers in the form of UMCGs rather than UMCG IBDs in the validation cohort; this has now been amended.
Inflammatory bowel diseases, which include Crohn’s disease and ulcerative colitis, affect several million individuals worldwide. Crohn’s disease and ulcerative colitis are complex diseases that are heterogeneous at the clinical, immunological, molecular, genetic, and microbial levels. Individual contributing factors have been the focus of extensive...
The inflammatory bowel diseases (IBDs), which include Crohn’s disease (CD) and ulcerative colitis (UC), are multifactorial chronic conditions of the gastrointestinal tract. While IBD has been associated with dramatic changes in the gut microbiota, changes in the gut metabolome—the molecular interface between host and microbiota—are less well unders...
We consider the problem of modelling count data with excess zeros using Zero-inflated Poisson (ZIP) regression. Recently, various regularization methods have been developed for variable selection in ZIP models. Among these, EM LASSO is a popular method for simultaneous variable selection and parameter estimation. However, EM LASSO suffers from esti...
In many biomedical applications, covariates are naturally grouped, with variables in the same group being systematically related or statistically correlated. Under such settings, variable selection must be conducted at both group and individual variable levels. Motivated by the widespread availability of zero-inflated count outcomes and grouped cov...
Classical bridge regression is known to possess many desirable statistical properties such as oracle, sparsity, and unbiasedness. One outstanding disadvantage of bridge regularization, however, is that it lacks a systematic approach to inference, reducing its flexibility in practical applications. In this study, we propose bridge regression from a...
Classical bridge regression is known to possess many desirable statistical properties such as oracle, sparsity, and unbiasedness. One outstanding disadvantage of bridge regularization, however, is that it lacks a systematic approach to inference, reducing its flexibility in practical applications. In this study, we propose bridge regression from a...
Objective:
Family history is the strongest risk factor for developing Crohn's disease (CD) or ulcerative colitis (UC). We investigated whether the proximity of relationship with the affected relative and concordance for type of IBD modifies the effect of family history on phenotype and disease severity.
Design:
This cross-sectional study include...
Characterizing the stability of the gut microbiome is important to exploit it as a therapeutic target and diagnostic biomarker. We metagenomically and metatranscriptomically sequenced the faecal microbiomes of 308 participants in the Health Professionals Follow-Up Study. Participants provided four stool samples—one pair collected 24–72 h apart and...
The gut microbiome is intimately related to human health, but it is not yet known which functional activities are driven by specific microorganisms' ecological configurations or transcription. We report a large-scale investigation of 372 human faecal metatranscriptomes and 929 metagenomes from a subset of 308 men in the Health Professionals Follow-...
Studies of the microbiome have become increasingly sophisticated, and multiple sequence-based, molecular methods as well as culture-based methods exist for population-scale microbiome profiles. To link the resulting host and microbial data types to human health, several experimental design considerations, data analysis challenges, and statistical e...
Compositional data consist of vectors of proportions normalized to a constant sum from a basis of unobserved counts. The sum constraint makes inference on correlations between unconstrained features challenging due to the information loss from normalization. However, such correlations are of long-standing interest in fields including ecology. We pr...
Detailed mathematical derivations.
Beginning from initial definitions, a step-by-step derivation of the likelihood in Eq (1), the marginal likelihood for the composition, and the Taylor Series approximation in Eq (2).
(DOCX)
Implementation of methods compared.
Details on how each of the methods compared in the Results section were implemented, run on the simulated data, and evaluated for type I and type II errors.
(DOCX)
Prior distributions for test cases.
The prior distributions for the test cases used a prior on m that was very uninformative, being centered at 0 and with a large variance. The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B).
(TIF)
Additional results for difficult scenarios.
The estimates and significance of several methods on the four scenarios (columns): simple, with no true correlations and no negative dominant spurious correlation; high spurious, with no true correlations and a negative dominant spurious correlation; retained spike, with several true correlations and no n...
Type I error rates and power in large datasets.
Results on simulated data comprising 100 SparseDOSSA-derived compositions modeled on a high-diversity dataset with 89 features. A Type I error rates are controlled across all correlation values only by BAnOCC. B Power is comparable between BAnOCC and other modern methods across spiked correlation stre...
Prior distributions for the posterior fornix body site.
We used a prior for m that gave reasonable behavior for the sum of the unobserved count medians ∑j=111emj (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).
(TIF)
Simulated data for difficult scenarios.
The simulated data for each of four difficult simulation scenarios described in the Results section. For details on how these were generated, see S2 Text.
(ZIP)
A relatively informative prior on λ is effective.
The densities of different priors on λ for different ranges of λ values. The shape parameter a determines how quickly the prior density decreasys, while the rate parameter b determines how much prior weight is placed on small λ values rather than large λ values.
(TIF)
Shrinkage increases for smaller λ.
A The shape of the prior on ojk and ojj for several values of λ. Smaller λ results in greater shrinkage towards zero. B-C The prior probability in the interval (−0.001,0.001) for each off-diagonal element ojk| λ∼Laplace(λ) across small (B) or large (C) values of λ. Small values (<0.1) of λ show the greatest shrink...
Prior distributions for realistic simulated data.
We used a prior for m that gave reasonable behavior for the sum of the unobserved count medians ∑j=114emj (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).
(TIF)
Average ROC curves of method performance on “realistic” simulated datasets.
For a given correlation strength, each ROC is calculated over the aggregation of all 735 true associations in 105 simulated datasets comprising SparseDOSSA-derived compositions with 100 samples modeled on a low-diversity dataset with 14 features. The cutoffs used are based...
Power across multiple sample sizes and numbers of features.
Power on simulated data comprising SparseDOSSA-derived compositions modeled on a low-diversity dataset with 14 features (small template) or a high-diversity dataset with 89 features (large template). See S2 Text for simulation details. The rows correspond to the number of samples (50, 100,...
Prior distributions for the stool body site.
We used a prior for m that gave reasonable behavior for the sum of the unobserved count medians ∑j=124emj (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).
(TIF)
Prior distributions for the buccal mucosa body site.
We used a prior for m that gave reasonable behavior for the sum of the unobserved count medians ∑j=121emj (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimates (B). (See also S14 Fig).
(TIF)
Implied priors on median unobserved counts.
The implied priors on the median unobserved counts emj (top panel) and the sum of the median unobserved counts ∑j=1pemj (bottom panel) for the SparseDOSSA simulated data and the body sites from the application. Each distribution is estimated using 100,000 random samples. The mean of mj was selected such t...
Comparison of inferred networks on HMP data.
The number of edges significant in both methods, neither method, or only one method, stratified by body site and whether the methods use the log-basis precision or correlation matrix. Most edges are concordantly significant (or not) between both methods; few are significant by only one method. Further, m...
Prior distributions for large datasets.
For our larger datasets simulated based on a stool dataset with 89 features, we used a prior for m that gave reasonable behavior for the sum of the basis medians ∑j=189emj (A). The prior on λ put most prior weight on λ values less than one and had narrow tails to encourage shrinkage of the correlation estimat...
Detailed description of simulated datasets.
Descriptions of how the datasets were generated for both the challenging scenarios case and for the realistic data case.
(DOCX)
Realistic simulated data.
All simulated datasets from sparseDOSSA, as well as the template dataset used. For details on how these were generated, see S2 Text.
(ZIP)
HMP taxonomic profiles.
The taxonomic profiles from the Human Microbiome Project data as processed with MetaPhlAn version 2.0_beta1 [24].
(ZIP)
AUC boxplots of method performance on “realistic” simulated datasets.
For each given correlation strength and template dataset, AUCs were calculated for each of 105 simulated datasets comprising sparseDOSSA-derived compositions with 100 samples modeled on a low-diversity dataset with 14 features. The ROCs used to measure the AUCs are based on p-val...
Type I error rates across multiple sample sizes and numbers of features.
Type I error rates on simulated data comprising SparseDOSSA-derived compositions modeled on a low-diversity dataset with 14 features (small template) or a high-diversity dataset with 89 features (large template). See S2 Text for simulation details. The rows correspond to the n...
BAnOCC stool network.
The significant edges from running BAnOCC on the stool body site with 5,500 warmup iterations and 12,000 total iterations. Edges are ordered by posterior median correlation magnitude. “hpd.95.ci” indicates the highest posterior density 95% credible intervals.
(XLSX)
BAnOCC posterior fornix network.
The significant edges from running BAnOCC on the posterior fornix body site with 1,500 warmup iterations and 5,000 total iterations. Edges are ordered by posterior median correlation magnitude. “hpd.95.ci” indicates the highest posterior density 95% credible intervals.
(XLSX)
BAnOCC buccal mucosa network.
The significant edges from running BAnOCC on the buccal mucosa body site with 5,500 warmup iterations and 12,000 total iterations. Edges are ordered by posterior median correlation magnitude. “hpd.95.ci” indicates the highest posterior density 95% credible intervals.
(XLSX)
Background
Recent advances in next-generation sequencing (NGS) technology enable researchers to collect a large volume of metagenomic sequencing data. These data provide valuable resources for investigating interactions between the microbiome and host environmental/clinical factors. In addition to the well-known properties of microbiome count measu...
Motivation: The human microbiome plays an important role in human health and disease. The composition of the human microbiome is influenced by multiple factors and understanding these factors is critical to elucidate the role of the microbiome in health and disease and for development of new diagnostics or therapeutic targets based on the microbiom...
Motivation: The human microbiome plays an important role in human health and disease. The composition of the human microbiome is influenced by multiple factors and understanding these factors is critical to elucidate the role of the microbiome in health and disease and for development of new diagnostics or therapeutic targets based on the microbiom...
Count data are increasingly ubiquitous in genetic association studies, where it is possible to observe excess zero counts as compared to what is expected based on standard assumptions. For instance, in rheumatology, data are usually collected in multiple joints within a person or multiple sub-regions of a joint, and it is not uncommon that the phen...
Count data are increasingly ubiquitous in genetic association studies, where it is possible to observe excess zero counts as compared to what is expected based on standard assumptions. For instance, in rheumatology, data are usually collected in multiple joints within a person or multiple sub-regions of a joint, and it is not uncommon that the phen...
Variable selection refers to the class of problems where one tries to find an
optimal subset of relevant variables, which can be used to accurately predict the
outcome of a certain response variable. Typically, a large number of variables
are often collected; however, all but a few important variables are relevant for
the prediction of the outcome,...