# Trevor HastieStanford University | SU · Department of Statistics

Trevor Hastie

Ph.D

## About

488

Publications

211,993

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

254,289

Citations

Citations since 2017

## Publications

Publications (488)

Importance:
Although oral temperature is commonly assessed in medical examinations, the range of usual or "normal" temperature is poorly defined.
Objective:
To determine normal oral temperature ranges by age, sex, height, weight, and time of day.
Design, setting, and participants:
This cross-sectional study used clinical visit information from...

Mental illnesses are a leading cause of disability globally. Across 17 psychiatric disorders, functional disability is often in part caused by cognitive impairments. However, cognitive heterogeneity in mental health is poorly understood, particularly in children.
We used generalized additive models (GAMs) to reconcile discrepant reports of cognitiv...

Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood. Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data. We prove that this is not the case for the linear model fit by ordinary least squares;...

The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman,...

Although literature suggests that resistance to TNF inhibitor (TNFi) therapy in patients with ulcerative colitis (UC) is partially linked to immune cell populations in the inflamed region, there is still substantial uncertainty underlying the relevant spatial context. Here, we used the highly multiplexed immunofluorescence imaging technology CODEX...

In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penal...

Mathematical models that accurately describe growth in human infants are lacking. We used the Michaelis-Menten equation, initially derived to relate substrate concentration to reaction rate, and subsequently modified and applied to nonhuman vertebrate growth, to model growth in humans from birth to 36 months. We compared the model results to actual...

Background and Objectives: Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated this equation c...

Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Gen...

In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that...

The increasing availability and scale of Genome Wide Association Studies (GWAS) bring new horizons for understanding biological mechanisms. PathGPS is an exploratory method that discovers genetic architecture using GWAS summary data. It can separate genetic components from unobserved environmental factors and extract clusters of genes and traits as...

Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number o...

We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10 ⁻⁵ ) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotypi...

Reconstructing three dimensional (3D) chromatin structure from conformation capture assays (such as Hi-C) is a critical task in computational biology, since chromatin spatial architecture plays a vital role in numerous cellular processes and direct imaging is challenging. We previously introduced Poisson metric scaling (PoisMS), a technique that mo...

Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an [Formula: see text] penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regulariz...

Low-rank matrix approximation is one of the central concepts in machine learning, with applications in dimension reduction, de-noising, multivariate statistical methodology, and many more. A recent extension to LRMA is called low-rank matrix completion (LRMC). It solves the LRMA problem when some observations are missing and is especially useful fo...

We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 × 10 ⁻⁵ ) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotypi...

Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey's method (LinCDE). LinCDE admits flexible modeling of the density...

In the regression setting, the standard linear model Y=β0+β1X1+⋯+βpXp+ϵ is commonly used to describe the relationship between a response Y and a set of variables X1, X2,…,Xp.

In this chapter, we discuss the support vector machine (SVM), an approach for classification that was developed in the computer science community in the 1990s and that has grown in popularity since then.

So far in this book, we have mostly focused on linear models. Linear models are relatively simple to describe and implement, and have advantages over other approaches in terms of interpretation and inference.

The linear regression model discussed in Chap. 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative.

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.

In this chapter, we will consider the topics of survival analysis and censored data. These arise in the analysis of a unique kind of outcome variable: the time until an event occurs.

Thus far, this textbook has mostly focused on estimation and its close cousin, prediction. In this chapter, we instead focus on hypothesis testing, which is key to conducting inference. We remind the reader that inference was briey discussed in Chapter 2.

Most of this book concerns supervised learning methods such as regression and classification. In the supervised learning setting, we typically have access to a set of p features X1,X2,…,Xp, measured on n observations, and a response Y also measured on those same n observations. The goal is then to predict Y using X1,X2,…,Xp.

In this chapter, we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode response value for the training observations in the region to which it belongs.

In order to motivate our study of statistical learning, we begin with a simple example. Suppose that we are statistical consultants hired by a client to investigate the association between advertising and sales of a particular product.

This chapter is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response. It has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning app...

This chapter covers the important topic of deep learning. At the time of writing (2020), deep learning is a very active area of research in the machine learning and artificial intelligence communities.

Using evidence derived from previously collected medical records to guide patient care has been a long standing vision of clinicians and informaticians, and one with the potential to transform medical practice. As a result of advances in technical infrastructure, statistical analysis methods, and the availability of patient data at scale, an implem...

Motivation
Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data.
Results
We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on milli...

Vital signs, including heart rate and body temperature, are useful in detecting or monitoring medical conditions, but are typically measured in the clinic and require follow-up laboratory testing for more definitive diagnoses. Here we examined whether vital signs as measured by consumer wearable devices (that is, continuously monitored heart rate,...

We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo‐observations of the HTE based on matching. Our contributions are three...

We propose to use the difference in natural parameters (DINA) to quantify the heterogeneous treatment effect for the exponential family, a.k.a. the hazard ratio for the Cox model, in contrast to the difference in means. For responses such as binary outcome and survival time, DINA is of more practical interest and convenient for modeling the covaria...

The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman,...

We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genet...

Motivation:
The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data.
Results:
We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is appl...

Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce a latent factor model of genetic risk b...

Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1 s.d.) pro...

An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modelin...

Professor Efron has presented us with a thought‐provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.

Canonical Correlation Analysis (CCA) is a technique for measuring the association between two multivariate sets of variables. The Regularized modification of Canonical Correlation Analysis (RCCA) imposing $\ell_2$ penalty on the CCA coefficients is widely used in applications while conducting the analysis of high dimensional data. One limitation of...

The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been sh...

Unmeasured or latent variables are often the cause of correlations between multivariate measurements and are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Genera...

The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patie...

We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do...

We propose a Sparse-Group regularized Cox regression method to analyze large-scale, ultrahigh-dimensional, and multi-response survival data efficiently. Our method has three key components: 1. A Sparse-Group penalty that encourages the coefficients to have small and overlapping support; 2. A variable screening procedure that minimizes the frequency...

Three dimensional (3D) genome spatial organization is critical for numerous cellular processes, including transcription, while certain conformation-driven structural alterations are frequently oncogenic. Genome architecture had been notoriously difficult to elucidate, but the advent of the suite of chromatin conformation capture assays, notably Hi-...

In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penal...

In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that...

Professor Efron has presented us with a thought-provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.

We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three...

We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the L ¹ -regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in (Qian et al. 2019). The output of our algorithm is the full Lasso path, the parameter estimates at all predefined...

In the US, the normal, oral temperature of adults is, on average, lower than the canonical 37°C established in the 19th century. We postulated that body temperature has decreased over time. Using measurements from three cohorts--the Union Army Veterans of the Civil War (N = 23,710; measurement years 1860–1940), the National Health and Nutrition Exa...

While many diseases of aging have been linked to the immunological system, immune metrics with which to identify the most at-risk individuals are lacking. Here, we studied the blood immunome of 1001 individuals age 8-96 and derived an inflammatory clock of aging (iAge), which tracked with multi-morbidity and immunosenescence. In centenarians, iAge...

Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While traditional models of genetic risk like polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce...

PURPOSE
The preoperative distinction between uterine leiomyoma (LM) and leiomyosarcoma (LMS) is difficult, which may result in dissemination of an unexpected malignancy during surgery for a presumed benign lesion. An assay based on circulating tumor DNA (ctDNA) could help in the preoperative distinction between LM and LMS. This study addresses the...