Robert Tibshirani

Robert Tibshirani
Stanford University | SU · Department of Statistics

About

594
Publications
152,766
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
199,985
Citations

Publications

Publications (594)
Article
We propose a method for supervised learning with multiple sets of features (“views”). The multiview problem is especially important in biology and medicine, where “-omics” data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. “Cooperative learning” combines the usual squared-error loss of predictions with an “a...
Article
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that...
Preprint
Clinical diagnoses rely on a wide variety of laboratory tests and imaging studies, interpreted alongside physical examination and documentation of symptoms and patient history. However, the tools of diagnosis make little use of the immune system’s internal record of specific disease exposures encoded by the antigen-specific receptors of memory B ce...
Article
Full-text available
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10 ⁻⁵ ) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotypi...
Preprint
Full-text available
Forecasting methodologies have always attracted a lot of attention and have become an especially hot topic since the beginning of the COVID-19 pandemic. In this paper we consider the problem of multi-period forecasting that aims to predict several horizons at once. We propose a novel approach that forces the prediction to be "smooth" across horizon...
Article
Full-text available
We consider the multi‐class classification problem when the training data and the out‐of‐sample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction sets). BCOPS constructs a prediction set C(x) as a subset of class labels, possibly empty. It tries to optimize the out‐of‐sample pe...
Preprint
Full-text available
Standard regression methods can lead to inconsistent estimates of causal effects when there are time-varying treatment effects and time-varying covariates. This is due to certain non-confounding latent variables that create colliders in the causal graph. These latent variables, which we call phantoms, do not harm the identifiability of the causal e...
Article
Full-text available
The outbreak of COVID-19 has created an unprecedent global crisis. While the polymerase chain reaction (PCR) is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of a significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology in...
Preprint
Full-text available
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 × 10 ⁻⁵ ) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotypi...
Article
Full-text available
High‐dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high‐throughput screening, electronic health records, and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement feature se...
Preprint
The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: operational since April 2020, it...
Article
Background While multiparametric MRI (mpMRI) has high sensitivity for detection of clinically significant prostate cancer (CSC), false positives and negatives remain common. Calculators that combine mpMRI with clinical variables can improve cancer risk assessment, while providing more accurate predictions for individual patients. We sought to creat...
Chapter
In the regression setting, the standard linear model Y=β0+β1X1+⋯+βpXp+ϵ is commonly used to describe the relationship between a response Y and a set of variables X1, X2,…,Xp.
Chapter
In this chapter, we discuss the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then.
Chapter
So far in this book, we have mostly focused on linear models. Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference.
Chapter
The linear regression model discussed in Chap. 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative.
Chapter
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
Chapter
In this chapter, we will consider the topics of survival analysis and censored data. These arise in the analysis of a unique kind of outcome variable: the time until an event occurs.
Chapter
Thus far, this textbook has mostly focused on estimation and its close cousin, prediction. In this chapter, we instead focus on hypothesis testing, which is key to conducting inference. We remind the reader that inference was briey discussed in Chapter 2.
Chapter
Most of this book concerns supervised learning methods such as regression and classification. In the supervised learning setting, we typically have access to a set of p features X1,X2,…,Xp, measured on n observations, and a response Y also measured on those same n observations. The goal is then to predict Y using X1,X2,…,Xp.
Chapter
In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode response value for the training observations in the region to which it belongs.
Chapter
In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to investigate the association between advertising and sales of a particular product.
Chapter
This chapter is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response. It has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning app...
Chapter
This chapter covers the important topic of deep learning. At the time of writing (2020), deep learning is a very active area of research in the machine learning and artificial intelligence communities.
Article
Full-text available
Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or “mutational signatures”. Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularizat...
Preprint
Full-text available
Reliable, short-term forecasts of traditional public health reporting streams (such as cases, hospitalizations, and deaths) are a key ingredient in effective public health decision-making during a pandemic. Since April 2020, our research group has worked with data partners to collect, curate, and make publicly available numerous real-time COVID-19...
Preprint
Full-text available
Using evidence derived from previously collected medical records to guide patient care has been a long standing vision of clinicians and informaticians, and one with the potential to transform medical practice. As a result of advances in technical infrastructure, statistical analysis methods, and the availability of patient data at scale, an implem...
Article
Motivation Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data. Results We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on milli...
Preprint
Full-text available
The outbreak of COVID-19 has created an unprecedent global crisis. While PCR is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology integrated to electrospray ionizatio...
Article
We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo‐observations of the HTE based on matching. Our contributions are three...
Article
en We propose a new method for supervised learning, the “principal components lasso” (“pcLasso”). It combines the lasso (ℓ1) penalty with a quadratic penalty that shrinks the coefficient vector toward the feature matrix's leading principal components (PCs). pcLasso can be especially powerful if the features are preassigned to groups. In that case,...
Preprint
Full-text available
High-dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high-throughput screening, electronic health records (EHRs), and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement fea...
Article
Motivation: The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data. Results: We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is appl...
Article
Full-text available
Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1 s.d.) pro...
Article
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modelin...
Article
Full-text available
SARS-CoV-2-specific antibodies, particularly those preventing viral spike receptor binding domain (RBD) interaction with host angiotensin-converting enzyme 2 (ACE2) receptor, can neutralize the virus. It is, however, unknown which features of the serological response may affect clinical outcomes of COVID-19 patients. We analyzed 983 longitudinal pl...
Article
Professor Efron has presented us with a thought‐provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.
Article
Sparse generalised additive models (GAMs) are an extension of sparse generalised linear models that allow a model's prediction to vary non‐linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interact...
Article
Full-text available
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been sh...
Article
Full-text available
The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patie...
Article
We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do...
Preprint
Full-text available
SARS-CoV-2-specific antibodies, particularly those preventing viral spike receptor binding domain (RBD) interaction with host angiotensin-converting enzyme 2 (ACE2) receptor, could offer protective immunity, and may affect clinical outcomes of COVID-19 patients. We analyzed 625 serial plasma samples from 40 hospitalized COVID-19 patients and 170 SA...
Article
Oral immunotherapy (OIT) can successfully desensitize allergic individuals to offending foods such as peanut. Our recent clinical trial (NCT02103270) of peanut OIT allowed us to monitor peanut-specific CD4+ T cells, using MHC-peptide Dextramers, over the course of OIT. We used a single-cell targeted RNAseq assay to analyze these cells at 0, 12, 24,...
Preprint
Full-text available
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penal...
Article
Full-text available
Metabolism during pregnancy is a dynamic and precisely programmed process, the failure of which can bring devastating consequences to the mother and fetus. To define a high-resolution temporal profile of metabolites during healthy pregnancy, we analyzed the untargeted metabolome of 784 weekly blood samples from 30 pregnant women. Broad changes and...
Preprint
Full-text available
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that...
Article
Professor Efron has presented us with a thought-provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.
Article
Full-text available
Radiologic screening of high-risk adults reduces lung-cancer-related mortality1,2; however, a small minority of eligible individuals undergo such screening in the United States3,4. The availability of blood-based tests could increase screening uptake. Here we introduce improvements to cancer personalized profiling by deep sequencing (CAPP-Seq)⁵, a...
Article
B cells in human food allergy have been studied predominantly in the blood. Little is known about IgE ⁺ B cells or plasma cells in tissues exposed to dietary antigens. We characterized IgE ⁺ clones in blood, stomach, duodenum, and esophagus of 19 peanut-allergic patients, using high-throughput DNA sequencing. IgE ⁺ cells in allergic patients are en...
Article
We propose a simple method for evaluating the model that has been chosen by an adaptive regression procedure, our main focus being the lasso. This procedure deletes each chosen predictor and refits the lasso to get a set of models that are "close" to the chosen "base model," and compares the error rates of the base model with that of nearby models....
Article
Full-text available
Elucidating the spectrum of epithelial-mesenchymal transition (EMT) and mesenchymal-epithelial transition (MET) states in clinical samples promises insights on cancer progression and drug resistance. Using mass cytometry time-course analysis, we resolve lung cancer EMT states through TGFβ-treatment and identify, through TGFβ-withdrawal, a distinct...
Preprint
Full-text available
Sparse generalized additive models (GAMs) are an extension of sparse generalized linear models which allow a model's prediction to vary non-linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interac...
Article
Clear cell renal cell carcinoma (ccRCC) is the most common and lethal subtype of kidney cancer. Intraoperative frozen section (IFS) analysis is used to confirm the diagnosis during partial nephrectomy (PN). However, surgical margin evaluation using IFS analysis is time consuming and unreliable, leading to relatively low utilization. In this study,...
Article
Full-text available
Thyroid neoplasia is common and requires appropriate clinical workup with imaging and fine-needle aspiration (FNA) biopsy to evaluate for cancer. Yet, up to 20% of thyroid nodule FNA biopsies will be indeterminate in diagnosis based on cytological evaluation. Genomic approaches to characterize the malignant potential of nodules showed initial promi...
Article
Full-text available
Background: The role of epithelial-mesenchymal transition (EMT) in NSCLC is well reported and has been shown to prime cells for metastasis. EMT can be adopted or reversed (i.e. mesenchymal-epithelial transition, MET) by cells, revealing plasticity that can also lead to drug resistance. Although it is appreciated that EMT is not a binary process of...
Article
Background: Dietary avoidance is recommended for peanut allergies. We evaluated the sustained effects of peanut allergy oral immunotherapy (OIT) in a randomised long-term study in adults and children. Methods: In this randomised, double-blind, placebo-controlled, phase 2 study, we enrolled participants at the Sean N Parker Center for Allergy and...
Article
Accurate prediction of long-term outcomes remains a challenge in the care of cancer patients. Due to the difficulty of serial tumor sampling, previous prediction tools have focused on pretreatment factors. However, emerging non-invasive diagnostics have increased opportunities for serial tumor assessments. We describe the Continuous Individualized...
Article
Full-text available
Benign prostatic hyperplasia (BPH) is the most common cause of lower urinary tract symptoms in men. Current treatments target prostate physiology rather than BPH pathophysiology and are only partially effective. Here, we applied next-generation sequencing to gain new insight into BPH. By RNAseq, we uncovered transcriptional heterogeneity among BPH...