Daniela Witten

Daniela Witten
University of Washington Seattle | UW · Department of Biostatistics

About

64
Publications
13,973
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,582
Citations

Publications

Publications (64)
Preprint
In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of t...
Article
While many methods are available to detect structural changes in a time series, few procedures are available to quantify the uncertainty of these estimates post‐detection. In this work, we fill this gap by proposing a new framework to test the null hypothesis that there is no change in mean around an estimated changepoint. We further show that it i...
Preprint
Full-text available
We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we take a selective inference approach. We propose a finite-sample p-value that controls the selective Type I er...
Preprint
Full-text available
The graph fused lasso -- which includes as a special case the one-dimensional fused lasso -- is widely used to reconstruct signals that are piecewise constant on a graph, meaning that nodes connected by an edge tend to have identical values. We consider testing for a difference in the means of two connected components estimated using the graph fuse...
Article
Full-text available
Brain networks are increasingly characterized at different scales, including summary statistics, community connectivity, and individual edges. While research relating brain networks to behavioral measurements has yielded many insights into brain‐phenotype relationships, common analytical approaches only consider network information at a single scal...
Preprint
Full-text available
Wide-field calcium imaging techniques allow recordings of high resolution neuronal activity across one or multiple brain regions. However, since the recordings capture light emission generated by the fluorescence of the calcium indicator, the neural activity that drives the calcium changes is masked by the calcium indicator dynamics. Here we develo...
Article
In this paper, we consider fitting a flexible and interpretable additive regression model in a data‐rich setting. We wish to avoid pre‐specifying the functional form of the conditional association between each covariate and the response, while still retaining interpretability of the fitted functions. A number of recent proposals in the literature f...
Article
In classical statistics, much thought has been put into experimental design and data collection. In the high-dimensional setting, however, experimental design has been less of a focus. In this paper, we stress the importance of collecting multiple replicates for each subject in the high-dimensional setting. We consider learning the structure of a g...
Article
We consider the problem of estimating multiple related Gaussian graphical models from a high-dimensional data set with observations belonging to distinct classes. We propose the joint graphical lasso, which borrows strength across the classes in order to estimate multiple graphical models that share certain characteristics, such as the locations or...
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modelin...
Article
We consider the problem of performing unsupervised learning in the presence of outliers - that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-mea...
Chapter
In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of...
Chapter
This chapter is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response. Linear regression has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistic...
Chapter
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw differe...
Chapter
The linear regression model discussed in Chapter 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative. For example, eye color is qualitative, taking on values blue, brown, or green. Often qualitative variables are referred to as categorical; we will use these terms interchangea...
Chapter
So far in this book, we have mostly focused on linear models. Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference. However, standard linear regression can have significant limitations in terms of predictive power. This is because the linearity assumption i...
Chapter
In the regression setting, the standard linear model
Chapter
In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the produ...
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modelin...
Article
It has been claimed that most research findings are false, and it is known that large-scale studies involving omics data are especially prone to errors in design, execution, and analysis. The situation is alarming because taxpayer dollars fund a substantial amount of biomedical research, and because the publication of a research article that is lat...
Article
Full-text available
Background Molecular characterization of tumors has been critical for identifying important genes in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. Howev...
Article
Following successful orthopaedic surgical procedures, implant removal is generally not necessary or recommended. However, patients with pain related to implants may benefit from this elective procedure. The foot and ankle may be more symptomatic from retained implants because of weight-bearing activities, shoe wear, and limited soft-tissue cushioni...
Article
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ≫ n, LDA is not appropriate for two reasons. First, the standard...
Article
We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate...
Article
High numbers of tumor-associated macrophages (TAMs) have been associated with poor outcome in several solid tumors. In 2 previous studies, we showed that colony stimulating factor-1 (CSF1) is secreted by leiomyosarcoma (LMS) and that the increase in macrophages and CSF1 associated proteins are markers for poor prognosis in both gynecologic and nong...
Article
We recently described two types of stromal response in breast cancer derived from gene expression studies of tenosynovial giant cell tumors and fibromatosis. The purpose of this study is to elucidate the basis of this stromal response--whether they are elicited by individual tumors or whether they represent an endogenous host reaction produced by t...
Article
Least squares multidimensional scaling (MDS) is a classical method for representing a n×n dissimilarity matrix D. One seeks a set of configuration points z1,…,zn∈RS such that D is well approximated by the Euclidean distances between the configuration points: Dij≈‖zi−zj‖2. Suppose that in addition to D, a vector of associated binary class labels y∈{...
Article
We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse...
Data
Distribution of sequence counts for number of unique microRNAs expressed in cervical cancer tissues and normal cervices.
Data
Novel candidate microRNAs identified from human cervical cancer and normal cervices.
Data
microRNAs (miRNAs) used in the nearest shrunken centroid classifier for normal versus tumour, as well as the standardized centroids for each of those miRNAs in each class.
Data
microRNAs (miRNAs) used in the nearest shrunken centroid classifier for normal versus adenocarcinoma versus squamous cell carcinoma, as well as the standardized centroids for each of those miRNAs in each class.
Data
Supplementary description of statistical analysis.
Data
Full-text available
The number of microRNAs found to be differentially-expressed at a given false discovery rate threshold for each of the resampled data sets.
Data
Median score of microRNAs within each cluster in log linear model, and corresponding P-values.
Data
Small RNA sequences obtained from 29 pairs of human cervical cancer tissues and matched normal tissues.
Data
microRNAs (miRNAs) used in the nearest shrunken centroid classifier for adenocarcinoma versus squamous cell carcinoma, as well as the standardized centroids for each of those miRNAs in each class.
Data
Comparison of sequencing and Northern data of the 29 cervical cancer samples studied.
Data
Known and novel microRNAs expressed in human cervical cancer tissues and matched normal tissues.
Data
Full-text available
A unique small RNA downstream of the Vault transcript.
Data
The false discovery rate of all microRNAs as determined by our Poisson log-linear model.
Data
Average correlation of microRNAs within each cluster, and corresponding P-values.
Data
The correlation matrix for each microRNA cluster and its P-value.
Article
Full-text available
Ultra-high throughput sequencing technologies provide opportunities both for discovery of novel molecular species and for detailed comparisons of gene expression patterns. Small RNA populations are particularly well suited to this analysis, as many different small RNAs can be completely sequenced in a single instrument run. We prepared small RNA li...
Article
Full-text available
Malignant cutaneous melanoma is a highly aggressive form of skin cancer. Despite improvements in early melanoma diagnosis, the 5-year survival rate remains low in advanced disease. Therefore, novel biomarkers are urgently needed to devise new means of detection and treatment. In this study, we aimed to improve our understanding of microRNA (miRNA)...
Article
Full-text available
Gene expression microarrays are the most widely used technique for genome-wide expression profiling. However, microarrays do not perform well on formalin fixed paraffin embedded tissue (FFPET). Consequently, microarrays cannot be effectively utilized to perform gene expression profiling on the vast majority of archival tumor samples. To address thi...
Article
Full-text available
Purpose Angiosarcoma of the breast is a rare, malignant tumor for which little is known regarding prognostic indicators and optimal therapeutic regimens. To address this issue, we performed a retrospective analysis of breast angiosarcoma cases seen at Stanford University along with immunohistochemical analysis for markers of angiogenesis. Methods...
Article
In recent years, breakthroughs in biomedical technology have led to a wealth of data in which the number of features (for instance, genes on which expression measurements are available) exceeds the number of observations (e.g. patients). Sometimes survival outcomes are also available for those same observations. In this case, one might be intereste...
Article
Full-text available
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as , where dk, uk, and vk minimize the squared Frobenius norm of X, subject to penalties on uk and vk. This results in a regularized version of the singular value decomposition. Of particular interest is...
Article
The genetic programs that promote retention of self-renewing leukemia stem cells (LSCs) at the apex of cellular hierarchies in acute myeloid leukemia (AML) are not known. In a mouse model of human AML, LSCs exhibit variable frequencies that correlate with the initiating MLL oncogene and are maintained in a self-renewing state by a transcriptional s...
Article
In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing...
Article
In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been sh...
Article
Full-text available
Orthopaedic procedures have been reported to have the highest incidence of pain compared to other types of operations. There are limited studies in the literature that investigate postoperative pain. A prospective study of 98 patients undergoing orthopedic foot and ankle operations was undertaken to evaluate their pain experience. A Short-Form McGi...
Article
We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Compon...
Article
We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Compon...
Article
A beneficial mutation that has nearly but not yet fixed in a population produces a characteristic haplotype configuration, called a partial selective sweep. Whether nonadaptive processes might generate similar haplotype configurations has not been extensively explored. Here, we consider 5 population genetic data sets taken from regions flanking hig...
Article
Full-text available
This manuscript describes a novel strategy to improve HIV DNA vaccine design. Employing a new information theory based bioinformatic algorithm, we identify a set of nucleotide motifs which are common in the coding region of HIV, but are under-represented in genes that are highly expressed in the human genome. We hypothesize that these motifs contri...
Article
It has recently been suggested that differentially-expressed genes in a microarray experiment are best identified using fold-change, rather than a t-statistic, because the former results in lists of differentially-expressed genes that are more reproducible (Shi et al. 2005, Guo et al. 2006, MAQC Consortium 2006). We argue that reproducibility does...

Network

Cited By