Trevor Hastie

Trevor Hastie
Stanford University | SU · Department of Statistics

Ph.D

About

495
Publications
248,527
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
284,333
Citations

Publications

Publications (495)
Article
Full-text available
This randomized clinical trial evaluated the effectiveness of short, digital interventions in improving physical activity and pain for individuals with knee osteoarthritis. We compared a digital mindset intervention, focusing on adaptive mindsets (e.g., osteoarthritis is manageable), to a digital education intervention and a no-intervention group....
Article
Importance Mental illnesses are a leading cause of disability globally, and functional disability is often in part caused by cognitive impairments across psychiatric disorders. However, studies have consistently reported seemingly opposite findings regarding the association between cognition and psychiatric symptoms. Objective To determine if the...
Article
The increasing availability and scale of biobanks and “omic” datasets bring new horizons for understanding biological mechanisms. PathGPS is an exploratory data analysis tool to discover genetic architectures using Genome Wide Association Studies (GWAS) summary data. PathGPS is based on a linear structural equation model where traits are regulated...
Article
Full-text available
Background Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis–Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated whether this equation could be...
Article
Importance: Although oral temperature is commonly assessed in medical examinations, the range of usual or "normal" temperature is poorly defined. Objective: To determine normal oral temperature ranges by age, sex, height, weight, and time of day. Design, setting, and participants: This cross-sectional study used clinical visit information from...
Preprint
Full-text available
Mental illnesses are a leading cause of disability globally. Across 17 psychiatric disorders, functional disability is often in part caused by cognitive impairments. However, cognitive heterogeneity in mental health is poorly understood, particularly in children. We used generalized additive models (GAMs) to reconcile discrepant reports of cognitiv...
Article
Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares;...
Article
Full-text available
The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman,...
Article
Full-text available
Although literature suggests that resistance to TNF inhibitor (TNFi) therapy in patients with ulcerative colitis (UC) is partially linked to immune cell populations in the inflamed region, there is still substantial uncertainty underlying the relevant spatial context. Here, we used the highly multiplexed immunofluorescence imaging technology CODEX...
Article
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penal...
Preprint
Full-text available
Mathematical models that accurately describe growth in human infants are lacking. We used the Michaelis-Menten equation, initially derived to relate substrate concentration to reaction rate, and subsequently modified and applied to nonhuman vertebrate growth, to model growth in humans from birth to 36 months. We compared the model results to actual...
Preprint
Full-text available
Background and Objectives: Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated this equation c...
Article
Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Gen...
Article
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that...
Preprint
Full-text available
The increasing availability and scale of Genome Wide Association Studies (GWAS) bring new horizons for understanding biological mechanisms. PathGPS is an exploratory method that discovers genetic architecture using GWAS summary data. It can separate genetic components from unobserved environmental factors and extract clusters of genes and traits as...
Article
Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number o...
Article
Full-text available
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10⁻⁵) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping...
Preprint
Full-text available
Reconstructing three dimensional (3D) chromatin structure from conformation capture assays (such as Hi-C) is a critical task in computational biology, since chromatin spatial architecture plays a vital role in numerous cellular processes and direct imaging is challenging. We previously introduced Poisson metric scaling (PoisMS), a technique that mo...
Article
Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an [Formula: see text] penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regulariz...
Preprint
Full-text available
Low-rank matrix approximation is one of the central concepts in machine learning, with applications in dimension reduction, de-noising, multivariate statistical methodology, and many more. A recent extension to LRMA is called low-rank matrix completion (LRMC). It solves the LRMA problem when some observations are missing and is especially useful fo...
Preprint
Full-text available
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 × 10 ⁻⁵ ) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotypi...
Preprint
Full-text available
Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey's method (LinCDE). LinCDE admits flexible modeling of the density...
Chapter
In the regression setting, the standard linear model Y=β0+β1X1+⋯+βpXp+ϵ is commonly used to describe the relationship between a response Y and a set of variables X1, X2,…,Xp.
Chapter
In this chapter, we discuss the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then.
Chapter
So far in this book, we have mostly focused on linear models. Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference.
Chapter
The linear regression model discussed in Chap. 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative.
Chapter
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
Chapter
In this chapter, we will consider the topics of survival analysis and censored data. These arise in the analysis of a unique kind of outcome variable: the time until an event occurs.
Chapter
Thus far, this textbook has mostly focused on estimation and its close cousin, prediction. In this chapter, we instead focus on hypothesis testing, which is key to conducting inference. We remind the reader that inference was briey discussed in Chapter 2.
Chapter
Most of this book concerns supervised learning methods such as regression and classification. In the supervised learning setting, we typically have access to a set of p features X1,X2,…,Xp, measured on n observations, and a response Y also measured on those same n observations. The goal is then to predict Y using X1,X2,…,Xp.
Chapter
In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode response value for the training observations in the region to which it belongs.
Chapter
In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to investigate the association between advertising and sales of a particular product.
Chapter
This chapter is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response. It has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning app...
Chapter
This chapter covers the important topic of deep learning. At the time of writing (2020), deep learning is a very active area of research in the machine learning and artificial intelligence communities.
Preprint
Full-text available
Using evidence derived from previously collected medical records to guide patient care has been a long standing vision of clinicians and informaticians, and one with the potential to transform medical practice. As a result of advances in technical infrastructure, statistical analysis methods, and the availability of patient data at scale, an implem...
Article
Motivation Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data. Results We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on milli...
Article
Full-text available
Vital signs, including heart rate and body temperature, are useful in detecting or monitoring medical conditions, but are typically measured in the clinic and require follow-up laboratory testing for more definitive diagnoses. Here we examined whether vital signs as measured by consumer wearable devices (that is, continuously monitored heart rate,...
Article
We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo‐observations of the HTE based on matching. Our contributions are three...
Preprint
Full-text available
We propose to use the difference in natural parameters (DINA) to quantify the heterogeneous treatment effect for the exponential family, a.k.a. the hazard ratio for the Cox model, in contrast to the difference in means. For responses such as binary outcome and survival time, DINA is of more practical interest and convenient for modeling the covaria...
Preprint
The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman,...
Preprint
Full-text available
We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genet...
Article
Motivation: The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data. Results: We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is appl...
Article
Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce a latent factor model of genetic risk b...
Article
Full-text available
Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1 s.d.) pro...
Article
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modelin...
Article
Professor Efron has presented us with a thought‐provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.
Preprint
Full-text available
Canonical Correlation Analysis (CCA) is a technique for measuring the association between two multivariate sets of variables. The Regularized modification of Canonical Correlation Analysis (RCCA) imposing $\ell_2$ penalty on the CCA coefficients is widely used in applications while conducting the analysis of high dimensional data. One limitation of...
Article
Full-text available
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been sh...
Preprint
Full-text available
Unmeasured or latent variables are often the cause of correlations between multivariate measurements and are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Genera...
Article
Full-text available
The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patie...
Article
We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do...
Preprint
Full-text available
We propose a Sparse-Group regularized Cox regression method to analyze large-scale, ultrahigh-dimensional, and multi-response survival data efficiently. Our method has three key components: 1. A Sparse-Group penalty that encourages the coefficients to have small and overlapping support; 2. A variable screening procedure that minimizes the frequency...
Preprint
Full-text available
Three dimensional (3D) genome spatial organization is critical for numerous cellular processes, including transcription, while certain conformation-driven structural alterations are frequently oncogenic. Genome architecture had been notoriously difficult to elucidate, but the advent of the suite of chromatin conformation capture assays, notably Hi-...
Preprint
Full-text available
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penal...
Preprint
Full-text available
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that...
Article
Professor Efron has presented us with a thought-provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.
Preprint
Full-text available
We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three...
Preprint
Full-text available
We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the L ¹ -regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in (Qian et al. 2019). The output of our algorithm is the full Lasso path, the parameter estimates at all predefined...
Article
Full-text available
In the US, the normal, oral temperature of adults is, on average, lower than the canonical 37°C established in the 19th century. We postulated that body temperature has decreased over time. Using measurements from three cohorts--the Union Army Veterans of the Civil War (N = 23,710; measurement years 1860–1940), the National Health and Nutrition Exa...
Preprint
While many diseases of aging have been linked to the immunological system, immune metrics with which to identify the most at-risk individuals are lacking. Here, we studied the blood immunome of 1001 individuals age 8-96 and derived an inflammatory clock of aging (iAge), which tracked with multi-morbidity and immunosenescence. In centenarians, iAge...
Preprint
Full-text available
Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While traditional models of genetic risk like polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce...