J. Sunil Rao's research while affiliated with University of Miami Miller School of Medicine and other places

Publications (94)

Article
Principal components analysis has been used to reduce the dimensionality of datasets for a long time. In this paper, we will demonstrate that in mode detection the components of smallest variance, the pettiest components, are more important. We prove that for a multivariate normal or Laplace distribution, we obtain boxes of optimal volume by implem...
Preprint
Full-text available
The topic of this paper is prevalence estimation from the perspective of active information. Prevalence among tested individuals has an upward bias under the assumption that individuals' willingness to be tested for the disease increases with the strength of their symptoms. Active information due to testing bias quantifies the degree at which the w...
Preprint
Full-text available
We argue that information from countries who had earlier COVID-19 surges can be used to inform another country's current model, then generating what we call back-to-the-future (BTF) projections. We show that these projections can be used to accurately predict future COVID-19 surges prior to an inflection point of the daily infection curve. We show,...
Article
Full-text available
Purpose: To compare the ability of linear mixed models with different random effect distributions to estimate rates of visual field loss in glaucoma patients. Methods: Eyes with five or more reliable standard automated perimetry (SAP) tests were identified from the Duke Glaucoma Registry. Mean deviation (MD) values from each visual field and ass...
Preprint
Philosophers frequently define knowledge as justified, true belief. In this paper we build a mathematical framework that makes possible to define learning (increased degree of true belief) and knowledge of an agent in precise ways. This is achieved by phrasing belief in terms of epistemic probabilities, defined from Bayes' Rule. The degree of true...
Preprint
Full-text available
Surveillance studies for Covid-19 prevalence estimation are subject to sampling bias due to oversampling of symptomatic individuals and error-prone tests, particularly rapid antigen tests which are known to have high false negative rates for asymptomatic individuals. This results in naive estimators which can be very far from the truth.In this work...
Preprint
Full-text available
Purpose: To compare the ability of linear mixed models with different random effect distributions to estimate rates of visual field loss in glaucoma patients. Design: Retrospective cohort study. Methods: Eyes with ≥5 reliable standard automated perimetry (SAP) tests were identified from the Duke Glaucoma Registry. Mean deviation (MD) values from ea...
Article
Full-text available
Background Collecting social determinants of health in electronic health records is time-consuming. Meanwhile, an Area Deprivation Index (ADI) aggregates sociodemographic information from census data. The objective of this study was to ascertain whether ADI is associated with stage of human papillomavirus (HPV)-related cancer at diagnosis. Methods...
Preprint
Full-text available
Principal component analysis has been used to reduce dimensionality of datasets for a long time. In this paper, we will demonstrate that in mode detection the components of smallest variance, the pettiest components, are more important. We prove that when the data follows a multivariate normal distribution, by implementing "pettiest component analy...
Article
Full-text available
Alcohol use disorder (AUD) is a widespread disease leading to the deterioration of cognitive and other functions. Mechanisms by which alcohol affects the brain are not fully elucidated. Splicing constitutes a nuclear process of RNA maturation, which results in the formation of the transcriptome. We tested the hypothesis as to whether AUD impairs sp...
Article
COVID-19 testing has become a standard approach for estimating prevalence which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symptoms are more likely to be t...
Preprint
Full-text available
We develop hypothesis testing for active information ---the averaged quantity in the Kullback-Liebler divergence. To our knowledge, this is the first paper to derive \textit{exact} probabilities of type-I errors for hypothesis testing in the area.
Preprint
Full-text available
We propose a new method to find modes based on active information. We develop an algorithm that, when applied to the whole space, will say whether there are any modes present \textit{and} where they are; this algorithm will reduce the dimensionality without resorting to Principal Components; and more importantly, population-wise, will not detect mo...
Article
Public genomic repositories are notoriously lacking in racially and ethnically diverse samples. This limits the reaches of exploration and has in fact been one of the driving factors for the initiation of the All of Us project. Our particular focus here is to provide a model-based framework for accurately predicting DNA methylation from genetic dat...
Article
Finite mixtures of regressions have been used to analyze data that come from a heterogeneous population. When more than one response is observed, accommodating a multivariate response can be useful. In this article, we go a step further and introduce a multivariate extension that includes a latent overlapping cluster indicator variable that allows...
Article
COVID-19 testing studies have become a standard approach for estimating prevalence and fatality rates which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symp...
Preprint
COVID-19 testing studies have become a standard approach for estimating prevalence and fatality rates which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symp...
Conference Paper
Full-text available
The genomics revolution also spawned the dawn of precision medicine. As in the National Research Council definition, if its promise is fully realized, then more accurate decisions about individual patient treatment decisions and outcomes will be possible. Disparities researchers have also begun looking to the precision medicine paradigm with the ho...
Article
Background : Contextual determinants of health including social, environmental, healthcare and others, are a so-called deck of cards one is dealt. The ability to modify health outcomes varies then based upon how one’s hand is played. It is thus of great interest to understand how these determinants associate with the emerging pandemic coronavirus d...
Preprint
Paraphrasing [Morano and Holt, 2017], contextual determinants of health including social, environmental, healthcare and others, are a so-called deck of cards one is dealt. The ability to modify health outcomes varies then based upon how one's hand is played. It is thus of great interest to understand how these determinants associate with the emergi...
Article
A small area typically refers to a subpopulation or domain of interest for which a reliable direct estimate, based only on the domain-specific sample, cannot be produced due to small sample size in the domain. While traditional small area methods and models are widely used nowadays, there have also been much work and interest in robust statistical...
Article
We develop hypothesis testing for active information — the averaged quantity in the Kullback–Leibler divergence. To our knowledge, this is the first paper to derive exact probabilities of type-I errors for hypothesis testing in the area.
Article
Full-text available
We propose a new method to find modes based on active information. We develop an algorithm called active information mode hunting (AIMH) that, when applied to the whole space, will say whether there are any modes present and where they are. We show AIMH is consistent and, given that information increases where probability decreases, it helps to ove...
Chapter
Precise outcome predictions at an individual level from diverse genomic data is a problem of great interest as the focus on precision medicine grows. This typically requires estimation of subgroup-specific models which may differ in their mean and/or variance structure. Thus in order to accurately predict outcomes for new individuals, it’s necessar...
Article
Full-text available
The Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) are two major studies that can be used to mine for therapeutic biomarkers for cancers of a large variety. Model validation using the two datasets however has proved challenging. Both predictions and signatures do not consistently validate well for models buil...
Chapter
Full-text available
Principal Components Analysis is a widely used technique for dimension reduction and characterization of variability in multivariate populations. Our interest lies in studying when and why the rotation to principal components can be used effectively within a response-predictor set relationship in the context of mode hunting. Specifically focusing o...
Article
Full-text available
Cancer cell lines have frequently been used to link drug sensitivity and resistance with genomic profiles. To capture genomic complexity in cancer, the Cancer Genome Project (CGP) (Garnett et al., 2012) screened 639 human tumor cell lines with 130 drugs ranging from known chemotherapeutic agents to experimental compounds. Questions of interest incl...
Article
Full-text available
Shrinkage estimators that possess the ability to produce sparse solutions have become increasingly important to the analysis of today's complex datasets. Examples include the LASSO, the Elastic-Net and their adaptive counterparts. Estimation of penalty parameters still presents difficulties however. While variable selection consistent procedures ha...
Article
Many practical problems are related to prediction, where the main interest is at subject (e.g., personalized medicine) or (small) sub-population (e.g., small community) level. In such cases, it is possible to make substantial gains in prediction accuracy by identifying a class that a new subject belongs to. This way, the new subject is potentially...
Article
Among the problems posed by high-dimensional datasets (so called p ≫ n paradigm) are that variable-specific estimators of variances are not reliable and tests statistics have low powers, both due to a lack of degrees of freedom. In addition, variance is observed to be a function of the mean. We introduce a non-parametric adaptive regularization pro...
Article
We present an implementation in the R language for statistical computing of our recent non-parametric joint adaptive mean-variance regularization and variance stabilization procedure. The method is specifically suited for handling difficult problems posed by high-dimensional multivariate datasets (p ≫ n paradigm), such as in 'omics'-type data, amon...
Article
PRIMsrc is a novel implementation of a non-parametric bump hunting procedure, based on the Patient Rule Induction Method (PRIM), offering a unified treatment of outcome variables, including censored time-to-event (Survival), continuous (Regression) and discrete (Classification) responses. To fit the model, it uses a recursive peeling procedure with...
Article
Full-text available
We introduce a survival/risk bump hunting framework to build a bump hunting model with a censored time-to-event response. Our method called Survival Bump Hunting relies on a rule-induction method, based on recursive peelings that uses specific survival peeling criteria such as hazards ratio or log-rank test statistics. To validate our model and imp...
Article
Principal Components Analysis is a widely used technique for dimension reduction and characterization of variability in multivariate populations. Our interest lies in studying when and why the rotation to principal components can be used effectively within a response-predictor set relationship in the context of mode hunting. Specifically focusing o...
Article
We propose a procedure associated with the idea of the E-M algorithm for model selection in the presence of missing data. The idea extends the concept of parameters to include both the model and the parameters under the model, and thus allows the model to be part of the E-M iterations. We develop the procedure, known as the E-MS algorithm, under th...
Article
Full-text available
We show that if we have an orthogonal base ($u_1,\ldots,u_p$) in a $p$-dimensional vector space, and select $p+1$ vectors $v_1,\ldots, v_p$ and $w$ such that the vectors traverse the origin, then the probability of $w$ being to closer to all the vectors in the base than to $v_1,\ldots, v_p$ is at least 1/2 and converges as $p$ increases to infinity...
Article
Full-text available
The paper addresses a common problem in the analysis of high-dimensional high-throughput "omics" data, which is parameter estimation across multiple variables in a set of data where the number of variables is much larger than the sample size. Among the problems posed by this type of data are that variable-specific estimators of variances are not re...
Article
Full-text available
The question of molecular heterogeneity and of tumoral phenotype in cancer remains unresolved. To understand the underlying molecular basis of this phenomenon, we analyzed genome-wide expression data of colon cancer metastasis samples, as these tumors are the most advanced and hence would be anticipated to be the most likely heterogeneous group of...
Article
Spike and slab models are a popular and attractive variable selection approach in regression settings. Applications for these models have blossomed over the last decade and they are increasingly being used in challenging problems. At the same time, theory for spike and slab models has not kept pace with the applications. There are many gaps in what...
Article
We derive the best predictive estimator (BPE) of the fixed parameters under two well-known small area models, the Fay–Herriot model and the nested-error regression model. This leads to a new prediction procedure, called observed best prediction (OBP), which is different from the empirical best linear unbiased prediction (EBLUP). We show that BPE is...
Article
The fence method [J. Jiang et al., Ann. Stat. 36, No. 4, 1669–1692 (2008; Zbl 1142.62047)] is a recently developed strategy for model selection. The idea involves a procedure to isolate a subgroup of what are known as correct models (of which the optimal model is a member). This is accomplished by constructing a statistical fence, or barrier, to ca...
Article
Full-text available
Crooked tail (Cd) mice bear a gain-of-function mutation in Lrp6, a co-receptor for canonical WNT signaling, and are a model of neural tube defects (NTDs), preventable with dietary folic acid (FA) supplementation. Whether the FA response reflects a direct influence of FA on LRP6 function was tested with prenatal supplementation in LRP6-deficient emb...
Article
Full-text available
The search for structures in real datasets e.g. in the form of bumps, components, classes or clusters is important as these often reveal underlying phenomena leading to scientific discoveries. One of these tasks, known as bump hunting, is to locate domains of a multidimensional input space where the target function assumes local maxima without pre-...
Article
Full-text available
Weighted generalized ridge regres-sion offers unique advantages in correlated high-dimensional problems. Such estimators can be efficiently computed using Bayesian spike and slab models and are effective for prediction. For sparse variable selection, a generalization of the elastic net can be used in tandem with these Bayesian estimates. In this ar...
Article
This paper considers the problem of selecting nonparametric models for small area estimation, which recently have received much attention. We develop a procedure based on the idea of fence method (Jiang, Rao, Gu and Nguyen 2008) for selecting the mean function for the small areas from a class of approximating splines. Simulation results show impres...
Article
Protein evolution is constrained by folding efficiency (“foldability”) and the implicit threat of toxic misfolding. A model is provided by proinsulin, whose misfolding is associated with β-cell dysfunction and diabetes mellitus. An insulin analogue containing a subtle core substitution (LeuA16 → Val) is biologically active, and its crystal structur...
Article
Full-text available
Protein evolution is constrained by folding efficiency ("foldability") and the implicit threat of toxic misfolding. A model is provided by proinsulin, whose misfolding is associated with beta-cell dysfunction and diabetes mellitus. An insulin analogue containing a subtle core substitution (Leu(A16) --> Val) is biologically active, and its crystal s...
Article
We used high-resolution SNP genotyping to identify regions of genomic gain and loss in the genomes of 212 medulloblastomas, malignant pediatric brain tumors. We found focal amplifications of 15 known oncogenes and focal deletions of 20 known tumor suppressor genes (TSG), most not previously implicated in medulloblastoma. Notably, we identified prev...
Article
In this short note, we propose a simplified adaptive fence procedure that reduces the computational burden of the adaptive fence procedure proposed by Jiang et al. [Jiang, J., Rao, J.S., Gu, Z., Nguyen, T., 2008. Fence methods for mixed model selection. Ann. Statist. 36, 1669-1692] for mixed model selection problems. The consistency property of th...
Article
Full-text available
Background. The computational identification of functional transcription factor binding sites (TFBSs) remains a major challenge of computational biology. Results. We have analyzed the conserved promoter sequences for the complete set of human RefSeq genes using our conserved transcription factor binding site (CONFAC) software. CONFAC identified 1...
Data
Supplementary Table S1 presents the complete dataset from the analysis given in Section 2 of the paper. Supplementary Tables S2–S8 show the complete data on conserved TFBS for each of the seven groups analyzed in Section 2.1 of the paper. Supplementary Tables S9–S16 demonstrate detailed results of the cross-validation summarized in Table 4 of the p...
Article
Full-text available
Many model search strategies involve trading off model fit with model complexity in a penalized goodness of fit measure. Asymptotic properties for these types of procedures in settings like linear regression and ARMA time series have been studied, but these do not naturally extend to nonstandard situations such as mixed effects models, where simple...
Article
Clustering of gene expression profiles is a widely used approach for finding macroscopic data structure. A complication in such analyses is that not all genes are informative for forming clusters and different clusters might have different transcription regulation. Driven by these considerations, we present a novel two-stage clustering approach. Th...
Article
We consider the properties of the highest posterior probability model in a linear regression setting. Under a spike and slab hierarchy we find that although highest posterior model selection is total risk consistent, it possesses hidden undesirable properties. One such property is a marked underfitting in finite samples, a phenomenon well noted for...
Article
We propose and assess a set of non-parametric ensembles, including bagging and boosting schemes, to recognize tumors in digital mammograms. Different approaches were examined as candidates for the two major components of the bagging ensembles, three spatial resampling schemes (residuals, centers and standardized centers), and four combination crite...
Article
This paper summarizes contributions to group 12 of the 15th Genetic Analysis Workshop. The papers in this group focused on multivariate methods and applications for the analysis of molecular data including genotypic data as well as gene expression microarray measurements and clinical phenotypes. A range of multivariate techniques have been employed...
Article
Full-text available
Standard genetic mapping techniques scan chromosomal segments for location of genetic linkage and association signals. The majority of these methods consider only correlations at single markers and/or phenotypes with explicit detailing of the genetic structure. These methods tend to be limited by their inability to consider the effect of large numb...
Article
Full-text available
In gene selection for cancer classification using microarray data, we define an eigenvalue-ratio statistic to measure a gene's contribution to the joint discriminability when this gene is included into a set of genes. Based on this eigenvalue-ratio statistic, we define a novel hypothesis testing for gene statistical redundancy and propose two gene...
Article
Full-text available
DNA microarrays open up a new horizon for studying the genetic determinants of disease. The high throughput nature of these arrays creates an enormous wealth of information, but also poses a challenge to data analysis. Inferential problems become even more pronounced as experimental designs used to collect data become more complex. An important exa...
Article
Increased DNA methylation is an epigenetic alteration that is common in human cancers and is often associated with transcriptional silencing. Aberrantly methylated DNA has also been proposed as a potential tumor marker. However, genes such as vimentin, which are transcriptionally silent in normal epithelium, have not until now been considered as ta...
Article
Variable selection in the linear regression model takes many apparent faces from both frequentist and Bayesian standpoints. In this paper we introduce a variable selection method referred to as a rescaled spike and slab model. We study the importance of prior hierarchical specifications and draw connections to frequentist generalized ridge regressi...
Article
DNA microarrays can provide insight into genetic changes that characterize different stages of a disease process. Accurate identification of these changes has significant therapeutic and diagnostic implications. Statistical analysis for multistage (multigroup) data is challenging, however. ANOVA-based extensions of two-sample Z-tests, a popular met...
Article
A simple technique is illustrated for analyzing multivariate survival data. The data situation arises when an individual records multiple survival events, or when individuals recording single survival events are grouped into clusters. Past work has focused on developing new methods to handle such data. Here, we use a connection between Poisson regr...
Article
We compared three methods of reporting maximal expiratory flow (V′maxFRC) measured in partial expiratory flow-volume curves (PEFVCs) at the point of functional residual capacity (FRC). PEFVCs were obtained with the rapid thoracoabdominal compression technique (RTC) on a total of 446 occasions in 281 HIV-negative, asymptomatic infants (4.8–28.1 mont...
Article
Oligonucleotide microarrays are amongst, a set of technologies that allow for high throughput assessment of vast numbers of gene expressions. In order to evaluate gene expressions given detection limits, antibody spiking is often used providing one with an expression curve relating antibody treated expression and non-antibody treated expression. Th...
Article
DNA microarrays open up a broad new horizon for investigators interested in studying the genetic determinants of disease. The high throughput nature of these arrays, where differential expression for thousands ofgenes can bemeasured simultaneously, creates an enormous wealth of information, but also poses a challenge for data analysis because of th...
Article
Full-text available
. We consider the problem of selecting the fixed and random effects in a mixed linear model. Two kinds of selection problems are considered. The first is to select the fixed covariates from a set of candidate predictors when the random effects are not subject to selection; the second is to select both the fixed covariates and the random effect fact...
Article
A time-modulated frailty model is proposed for analyzing multivariate failure data. The effect of frailties, which may not be constant over time, is discussed. We assume a parametric model for the baseline hazard, but avoid the parametric assumption for the frailty distribution. The well-known connection between survival times and Poisson regressio...
Article
Traditional statistical analysis of 2 surgeons' experiences with resectable malignant melanoma during a 30-year period (November 1970-July 2000) was compared with new tree-structured recursive partitioning regression analysis. A total of 1018 consecutive patients were registered and 983 patients were evaluable. Disease-free survival (DFS) and melan...
Article
Full-text available
Mutations in dystrophin cause Duchenne muscular dystrophy (DMD), but absent dystrophin does not invariably cause necrosis in all muscles, life stages and species. Using DNA microarray, we established a molecular signature of dystrophinopathy in the mdx mouse, with evidence that secondary mechanisms are key contributors to pathogenesis. We used vari...
Conference Paper
Multidimensional regression analysis relates a target outcome Y to a vector of predictors X through a variety of possible link functions depending on the distribution of Y. The predictors may be used in a linear fashion or given a more data-driven nonparametric functional form. These variations on the modeling paradigm cover the standard linear mod...
Article
Full-text available
We prospectively studied children with and without maternally transmitted HIV-1 infection born to mothers infected with HIV-1 to determine the incidence of chronic radiographic lung changes (CRC) and to correlate these changes with clinical assessments. Between 1990 and 1997, we scored 3050 chest radiographs using a standardized form. Group I child...