About
86
Publications
17,151
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
14,013
Citations
Citations since 2017
Publications
Publications (86)
The analysis of single-cell RNA sequencing (scRNA-seq) data often involves fitting a latent variable model to learn a low-dimensional representation for the cells. Validating such a model poses a major challenge. If we could sequence the same set of cells twice, we could use one dataset to fit a latent variable model and the other to validate it. I...
Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. [2020], applying a standard in...
Whole-chromosome aneuploidy and large segmental amplifications can have devastating effects in multicellular organisms, from developmental disorders and miscarriage to cancer. Aneuploidy in single-celled organisms such as yeast also results in proliferative defects and reduced viability. Yet, paradoxically, CNVs are routinely observed in laboratory...
Our goal is to develop a general strategy to decompose a random variable $X$ into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, $X$ can be "thinned" into independent random variables $X^{(1)}, \ldots, X^{(K)}$, such th...
We propose data thinning, a new approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to any observation drawn from a "convolution close...
In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell’s state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps,...
We consider testing whether a set of Gaussian variables, selected from the data, is independent of the remaining variables. We assume that this set is selected via a very simple approach that is commonly used across scientific disciplines: we select a set of variables for which the correlation with all variables outside the set falls below some thr...
Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the...
The graph fused lasso—which includes as a special case the one-dimensional fused lasso—is widely used to reconstruct signals that are piecewise constant on a graph, meaning that nodes connected by an edge tend to have identical values. We consider testing for a difference in the means of two connected components estimated using the graph fused lass...
In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of t...
While many methods are available to detect structural changes in a time series, few procedures are available to quantify the uncertainty of these estimates post‐detection. In this work, we fill this gap by proposing a new framework to test the null hypothesis that there is no change in mean around an estimated changepoint. We further show that it i...
We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we take a selective inference approach. We propose a finite-sample p-value that controls the selective Type I er...
We propose a sparse reduced rank Huber regression for analyzing large and complex high-dimensional data with heavy-tailed random noise. The proposed method is based on a convex relaxation of a rank- and sparsity-constrained nonconvex optimization problem, which is then solved using a block coordinate descent and an alternating direction method of m...
The graph fused lasso -- which includes as a special case the one-dimensional fused lasso -- is widely used to reconstruct signals that are piecewise constant on a graph, meaning that nodes connected by an edge tend to have identical values. We consider testing for a difference in the means of two connected components estimated using the graph fuse...
Brain networks are increasingly characterized at different scales, including summary statistics, community connectivity, and individual edges. While research relating brain networks to behavioral measurements has yielded many insights into brain‐phenotype relationships, common analytical approaches only consider network information at a single scal...
Wide-field calcium imaging techniques allow recordings of high resolution neuronal activity across one or multiple brain regions. However, since the recordings capture light emission generated by the fluorescence of the calcium indicator, the neural activity that drives the calcium changes is masked by the calcium indicator dynamics. Here we develo...
In this paper, we consider fitting a flexible and interpretable additive regression model in a data‐rich setting. We wish to avoid pre‐specifying the functional form of the conditional association between each covariate and the response, while still retaining interpretability of the fitted functions. A number of recent proposals in the literature f...
In classical statistics, much thought has been put into experimental design and data collection. In the high-dimensional setting, however, experimental design has been less of a focus. In this paper, we stress the importance of collecting multiple replicates for each subject in the high-dimensional setting. We consider learning the structure of a g...
We consider the problem of estimating multiple related Gaussian graphical models from a high-dimensional data set with observations belonging to distinct classes. We propose the joint graphical lasso, which borrows strength across the classes in order to estimate multiple graphical models that share certain characteristics, such as the locations or...
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modelin...
We consider the problem of performing unsupervised learning in the presence of outliers - that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-mea...
In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of...
This chapter is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response. Linear regression has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistic...
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw differe...
The linear regression model discussed in Chapter 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative. For example, eye color is qualitative, taking on values blue, brown, or green. Often qualitative variables are referred to as categorical; we will use these terms interchangea...
So far in this book, we have mostly focused on linear models. Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference. However, standard linear regression can have significant limitations in terms of predictive power. This is because the linearity assumption i...
In the regression setting, the standard linear model
In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the produ...
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modelin...
It has been claimed that most research findings are false, and it is known that large-scale studies involving omics data are especially prone to errors in design, execution, and analysis. The situation is alarming because taxpayer dollars fund a substantial amount of biomedical research, and because the publication of a research article that is lat...
Background
Molecular characterization of tumors has been critical for identifying important genes in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. Howev...
Following successful orthopaedic surgical procedures, implant removal is generally not necessary or recommended. However, patients with pain related to implants may benefit from this elective procedure. The foot and ankle may be more symptomatic from retained implants because of weight-bearing activities, shoe wear, and limited soft-tissue cushioni...
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ≫ n, LDA is not appropriate for two reasons. First, the standard...
We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative
genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable.
Moreover, normalization is challenging because different sequencing experiments may generate...
High numbers of tumor-associated macrophages (TAMs) have been associated with poor outcome in several solid tumors. In 2 previous studies, we showed that colony stimulating factor-1 (CSF1) is secreted by leiomyosarcoma (LMS) and that the increase in macrophages and CSF1 associated proteins are markers for poor prognosis in both gynecologic and nong...
We recently described two types of stromal response in breast cancer derived from gene expression studies of tenosynovial giant cell tumors and fibromatosis. The purpose of this study is to elucidate the basis of this stromal response--whether they are elicited by individual tumors or whether they represent an endogenous host reaction produced by t...
Least squares multidimensional scaling (MDS) is a classical method for representing a n×n dissimilarity matrix D. One seeks a set of configuration points z1,…,zn∈RS such that D is well approximated by the Euclidean distances between the configuration points: Dij≈‖zi−zj‖2. Suppose that in addition to D, a vector of associated binary class labels y∈{...
We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse...
Distribution of sequence counts for number of unique microRNAs expressed in cervical cancer tissues and normal cervices.
Novel candidate microRNAs identified from human cervical cancer and normal cervices.
microRNAs (miRNAs) used in the nearest shrunken centroid classifier for normal versus tumour, as well as the standardized centroids for each of those miRNAs in each class.
microRNAs (miRNAs) used in the nearest shrunken centroid classifier for normal versus adenocarcinoma versus squamous cell carcinoma, as well as the standardized centroids for each of those miRNAs in each class.
Supplementary description of statistical analysis.
The number of microRNAs found to be differentially-expressed at a given false discovery rate threshold for each of the resampled data sets.
Median score of microRNAs within each cluster in log linear model, and corresponding P-values.
Small RNA sequences obtained from 29 pairs of human cervical cancer tissues and matched normal tissues.
microRNAs (miRNAs) used in the nearest shrunken centroid classifier for adenocarcinoma versus squamous cell carcinoma, as well as the standardized centroids for each of those miRNAs in each class.
Comparison of sequencing and Northern data of the 29 cervical cancer samples studied.
Known and novel microRNAs expressed in human cervical cancer tissues and matched normal tissues.
A unique small RNA downstream of the Vault transcript.
The false discovery rate of all microRNAs as determined by our Poisson log-linear model.
Average correlation of microRNAs within each cluster, and corresponding P-values.
The correlation matrix for each microRNA cluster and its P-value.
Ultra-high throughput sequencing technologies provide opportunities both for discovery of novel molecular species and for detailed comparisons of gene expression patterns. Small RNA populations are particularly well suited to this analysis, as many different small RNAs can be completely sequenced in a single instrument run.
We prepared small RNA li...
Malignant cutaneous melanoma is a highly aggressive form of skin cancer. Despite improvements in early melanoma diagnosis, the 5-year survival rate remains low in advanced disease. Therefore, novel biomarkers are urgently needed to devise new means of detection and treatment. In this study, we aimed to improve our understanding of microRNA (miRNA)...
Gene expression microarrays are the most widely used technique for genome-wide expression profiling. However, microarrays do not perform well on formalin fixed paraffin embedded tissue (FFPET). Consequently, microarrays cannot be effectively utilized to perform gene expression profiling on the vast majority of archival tumor samples. To address thi...
Purpose
Angiosarcoma of the breast is a rare, malignant tumor for which little is known regarding prognostic indicators and optimal therapeutic regimens. To address this issue, we performed a retrospective analysis of breast angiosarcoma cases seen at Stanford University along with immunohistochemical analysis for markers of angiogenesis.
Methods...
In recent years, breakthroughs in biomedical technology have led to a wealth of data in which the number of features (for instance, genes on which expression measurements are available) exceeds the number of observations (e.g. patients). Sometimes survival outcomes are also available for those same observations. In this case, one might be intereste...
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as , where dk, uk, and vk minimize the squared Frobenius norm of X, subject to penalties on uk and vk. This results in a regularized version of the singular value decomposition. Of particular interest is...
The genetic programs that promote retention of self-renewing leukemia stem cells (LSCs) at the apex of cellular hierarchies in acute myeloid leukemia (AML) are not known. In a mouse model of human AML, LSCs exhibit variable frequencies that correlate with the initiating MLL oncogene and are maintained in a self-renewing state by a transcriptional s...
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing...
In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been sh...
Orthopaedic procedures have been reported to have the highest incidence of pain compared to other types of operations. There are limited studies in the literature that investigate postoperative pain.
A prospective study of 98 patients undergoing orthopedic foot and ankle operations was undertaken to evaluate their pain experience. A Short-Form McGi...
We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Compon...
We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Compon...
A beneficial mutation that has nearly but not yet fixed in a population produces a characteristic haplotype configuration, called a partial selective sweep. Whether nonadaptive processes might generate similar haplotype configurations has not been extensively explored. Here, we consider 5 population genetic data sets taken from regions flanking hig...
This manuscript describes a novel strategy to improve HIV DNA vaccine design. Employing a new information theory based bioinformatic algorithm, we identify a set of nucleotide motifs which are common in the coding region of HIV, but are under-represented in genes that are highly expressed in the human genome. We hypothesize that these motifs contri...
It has recently been suggested that differentially-expressed genes in a microarray experiment are best identified using fold-change, rather than a t-statistic, because the former results in lists of differentially-expressed genes that are more reproducible (Shi et al. 2005, Guo et al. 2006, MAQC Consortium 2006). We argue that reproducibility does...