Emmanuel J. Candès’s research while affiliated with Stanford University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (265)


Characterizing the Training-Conditional Coverage of Full Conformal Inference in High Dimensions
  • Preprint

February 2025

·

4 Reads

Isaac Gibbs

·

Emmanuel J. Candès

We study the coverage properties of full conformal regression in the proportional asymptotic regime where the ratio of the dimension and the sample size converges to a constant. In this setting, existing theory tells us only that full conformal inference is unbiased, in the sense that its average coverage lies at the desired level when marginalized over both the new test point and the training data. Considerably less is known about the behaviour of these methods conditional on the training set. As a result, the exact benefits of full conformal inference over much simpler alternative methods is unclear. This paper investigates the behaviour of full conformal inference and natural uncorrected alternatives for a broad class of L2L_2-regularized linear regression models. We show that in the proportional asymptotic regime the training-conditional coverage of full conformal inference concentrates at the target value. On the other hand, simple alternatives that directly compare test and training residuals realize constant undercoverage bias. While these results demonstrate the necessity of full conformal in correcting for high-dimensional overfitting, we also show that this same methodology is redundant for the related task of tuning the regularization level. In particular, we show that full conformal inference still yields asymptotically valid coverage when the regularization level is selected using only the training set, without consideration of the test point. Simulations show that our asymptotic approximations are accurate in finite samples and can be readily extended to other popular full conformal variants, such as full conformal quantile regression and the LASSO, that do not directly meet our assumptions.


Figure 3: Characterization of POPPER. (1) POPPER designs biologically relevant falsification experiments. (2) It performs multiple logical steps to execute the experiment. (3) It employs a wide range of statistical tests. (4) Progression of cumulative e-values across multiple iterations of falsification tests. More details are available in Appendix E.
Figure 5: Failure mode distribution for POPPER, labaled automatically by O1 and manually checked by humans.
Evaluation of various LLM backbones with POPPER.
Automated Hypothesis Validation with Agentic Sequential Falsifications
  • Preprint
  • File available

February 2025

·

31 Reads

Kexin Huang

·

Ying Jin

·

Ryan Li

·

[...]

·

Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.

Download

Overparameterized ReLU Neural Networks Learn the Simplest Model: Neural Isometry and Phase Transitions

January 2025

·

6 Reads

·

1 Citation

IEEE Transactions on Information Theory

The practice of deep learning has shown that neural networks generalize remarkably well even with an extreme number of learned parameters. This appears to contradict traditional statistical wisdom, in which a trade-off between model complexity and fit to the data is essential. We aim to address this discrepancy by adopting a convex optimization and sparse recovery perspective. We consider the training and generalization properties of two-layer ReLU networks with standard weight decay regularization. Under certain regularity assumptions on the data, we show that ReLU networks with an arbitrary number of parameters learn only simple models that explain the data. This is analogous to the recovery of the sparsest linear model in compressed sensing. For ReLU networks and their variants with skip connections or normalization layers, we present isometry conditions that ensure the exact recovery of planted neurons. For randomly generated data, we show the existence of a phase transition in recovering planted neural network models, which is easy to describe: whenever the ratio between the number of samples and the dimension exceeds a numerical threshold, the recovery succeeds with high probability; otherwise, it fails with high probability. Surprisingly, ReLU networks learn simple and sparse models that generalize well even when the labels are noisy. The phase transition phenomenon is confirmed through numerical experiments.



Second-order group knockoffs with applications to GWAS

September 2024

·

9 Reads

·

2 Citations

Bioinformatics

Motivation Conditional testing via the knockoff framework allows one to identify—among large number of possible explanatory variables—those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying genetic variants which influence traits of medical relevance. Results While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct ‘‘group knockoffs." While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank. Availability The described algorithms are implemented in an open-source Julia package Knockoffs.jl. R and Python wrappers are available as knockoffsr and knockoffspy packages. Supplementary information Supplementary data are available from Bioinformatics online.


RandALO: Out-of-sample risk estimation in no time flat

September 2024

·

1 Read

Estimating out-of-sample risk for models trained on large high-dimensional datasets is an expensive but essential part of the machine learning process, enabling practitioners to optimally tune hyperparameters. Cross-validation (CV) serves as the de facto standard for risk estimation but poorly trades off high bias (K-fold CV) for computational cost (leave-one-out CV). We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than K-fold CV. We support our claims with extensive simulations on synthetic and real data and provide a user-friendly Python package implementing RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.


Can Unconfident LLM Annotations Be Used for Confident Conclusions?

August 2024

·

46 Reads

·

1 Citation

Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings--text politeness, stance, and bias--reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.


Figure 3. Comparison of the marginal boosting procedure of Stutz et al. [25] (blue) against our conditional boosting method (orange). The left panel shows the prediction sets produced by each method, while the right panel displays the conditional coverage, P(Yn+1 ∈ ˆ C(Xn+1) | Xn+1) against the values of the first feature, X (1) n+1 . Using the Adam optimizer with learning rate set to 0.001, the plotted scores are boosted for 500 steps on a synthetic dataset of size n = 1000 and evaluated on another test dataset of size n = 2000.
Figure 6. Comparison of the split conformal calibration method of Mohri and Hashimoto [19] (blue) against our conditional conformal method (orange). The left and right panels displays the miscoverage and percentage of claims retained by the two methods against the number of views received by the Wikipedia pages in January 2023. The displayed boxplots are computed over 200 trials in which we run both methods on a calibration set of 5890 points and evaluate their coverage on a test set of size 2500. The displayed groups correspond to view counts that are binned into the intervals (−∞, ∞), [1000000, ∞), [100000, 1000000), [1000, 100000), [100, 1000), [0, 100) .
Figure 7. For a hold-out set of size 424, we plot the optimal level threshold, α * i for claim retention (13) against the number of Wikipedia page views (on the log scale) for the associated person. Letting vi denote the view count for person i, the black line denotes the estimate of the 0.25-quantile of α * i | vi obtained by regressing over the function class given by {β0 + 3 i=1 βiv i | β ∈ R 4 }.
Figure 8. Comparison of the realized and nominal levels of our level-adaptive method for various choices of F. Here, (Xi, Yi) ∼ N (0, I2) and we set the conformity score to be simply S(Xi, Yi) = Yi. In this experiment, we use our level-adaptive method (Section 3.2 in the main text) to construct prediction sets for Yi given covariates Xi. We set α(X) = σ(X) where σ(·) denotes the sigmoid function with temperature 1. We then run the level-adaptive method on a calibration set of size n = 1000 and evaluate the method on 1 test point; the plotted points (center, right) are obtained from 500 trials. The left panel shows results for F = {x → β : β ∈ R}, while the right panel displays results for F = {(1, α(X)) ⊤ β : β ∈ R 2 }.
Large language model validity via enhanced conformal prediction methods

June 2024

·

60 Reads

We develop new conformal inference methods for obtaining validity guarantees on the output of large language models (LLMs). Prior work in conformal language modeling identifies a subset of the text that satisfies a high-probability guarantee of correctness. These methods work by filtering claims from the LLM's original response if a scoring function evaluated on the claim fails to exceed a threshold calibrated via split conformal prediction. Existing methods in this area suffer from two deficiencies. First, the guarantee stated is not conditionally valid. The trustworthiness of the filtering step may vary based on the topic of the response. Second, because the scoring function is imperfect, the filtering step can remove many valuable and accurate claims. We address both of these challenges via two new conformal methods. First, we generalize the conditional conformal procedure of Gibbs et al. (2023) in order to adaptively issue weaker guarantees when they are required to preserve the utility of the output. Second, we show how to systematically improve the quality of the scoring function via a novel algorithm for differentiating through the conditional conformal procedure. We demonstrate the efficacy of our approach on both synthetic and real-world datasets.


Figure 2: Schematic drawing showing the selection of the number of boosting rounds via crossvalidation. Left: we hold out fold j, and use the remaining k − 1 folds to generate candidate scores E (t)
Figure 4: Comparison of test set power (averaged interval length) evaluated on the meps-19 and blog datasets: classical Local and CQR conformal procedure versus the boosted procedures (abbreviated as 'Localb' and 'CQRb') compared in each of the 4 leaves of a regression tree trained on the training set to predict the label Y . A positive log ratio value between the regular and boosted interval lengths indicates improvement from boosting. The target miscoverage rate is set at α = 10%.
Test set maximum deviation loss ℓ M evaluated on various conformalized intervals. The best result achieved for each dataset is highlighted in bold.
Test set power ℓ P evaluated on various conformalized prediction intervals. The best result achieved for each dataset is highlighted in bold.
Boosted Conformal Prediction Intervals

June 2024

·

56 Reads

This paper introduces a boosted conformal procedure designed to tailor conformalized prediction intervals toward specific desired properties, such as enhanced conditional coverage or reduced interval length. We employ machine learning techniques, notably gradient boosting, to systematically improve upon a predefined conformity score function. This process is guided by carefully constructed loss functions that measure the deviation of prediction intervals from the targeted properties. The procedure operates post-training, relying solely on model predictions and without modifying the trained model (e.g., the deep network). Systematic experiments demonstrate that starting from conventional conformal methods, our boosted procedure achieves substantial improvements in reducing interval length and decreasing deviation from target conditional coverage.


Cross-prediction-powered inference

April 2024

·

19 Reads

·

23 Citations

Proceedings of the National Academy of Sciences

While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference [A. N. Angelopoulos, S. Bates, C. Fannjiang, M. I. Jordan, T. Zrnic, Science 382 , 669–674 (2023)], which assumes that a good pretrained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its CIs typically have significantly lower variability.


Citations (43)


... , c), c > 0, which corresponds to a symmetric simplex, the random vector X chosen uniformly from K n does not consist of finitely exchangeable random variables. Recently the notion of weighted exchangeability has been introduced in the infinite setting in [1], and been studied in the finite setting in [31]. Briefly speaking, weighted exchangeable random variables are classically exchangeable after a change of variables. ...

Reference:

Limit Theorems Under Several Linear Constraints
De Finetti’s theorem and related results for infinite weighted exchangeable sequences
  • Citing Article
  • November 2024

Bernoulli

... It is also possible to work top-down, by first identifying a set of psychological features in data (i.e., through manual annotation) and then characterizing the consistent language features. This approach has a long history and is gaining traction through recent advances in explainable AI, where feature inputs are assessed for their contributions to otherwise opaque models (e.g., transformer-based neural networks like BERT 222,223 ), and in generative AI, which enables computers to do qualitative labeling instead of humans 192,224 . Computer annotation of emotion, for example, is mostly of similar quality to human annotation 193,225 , with the major caveat that models are primarily trained on US English texts, and so give annotations consistent with an average American 226,227 . ...

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

... Recent research has focused on developing general-purpose frameworks for asymptotically valid inference with flexible function approximators, particularly black-box AI such as large language models (LLMs). Notable contributions in this area include "prediction-powered inference" (Angelopoulos et al., 2023(Angelopoulos et al., , 2024Zrnic and Candès, 2024;Zrnic and Candes, 2024;Ji et al., 2025;Kluger et al., 2025), "design-based supervised learning" (Egami et al., 2023(Egami et al., , 2024, "flexible regression adjustment" (List et al., 2024), and the applied econometric framework of Ludwig et al. (2024). These works depart from semiparametric theory, in stark contrast to Chernozhukov et al. (2018), which centers it. ...

Cross-prediction-powered inference
  • Citing Article
  • April 2024

Proceedings of the National Academy of Sciences

... The proposed spectral and aggregation function parameters are trained using optimization methods through SURE. Adjust the threshold parameter in the soft threshold step to minimize the computed SURE value [20]. This involves an iterative optimization process for the proposed network. ...

Tractable Evaluation of Stein's Unbiased Risk Estimate With Convex Regularizers
  • Citing Article
  • January 2023

IEEE Transactions on Signal Processing

... Finally, Ramdas et al. [40] established the validity of a wide variety of generalized permutation tests (involving arbitrary subsets of the permutation distribution and nonuniform distributions over them). However, their work does not discuss power or computational complexity, two matters of primary importance in the present work. ...

Permutation Tests Using Arbitrary Permutation Distributions

Sankhya A

... The works most closely related to our own concern the construction of predictive lower bounds for survival times at a fixed confidence level α ∈ (0, 1), using censored data. These methods seek to deliver distributionfree prediction intervals while correcting for the non-exchangeability (Barber et al., 2023) introduced by censoring. ...

Conformal prediction beyond exchangeability
  • Citing Article
  • April 2023

The Annals of Statistics

... Indeed, there are several such tools that augment p-values with side-information [25,[33][34][35][36]50]. When it comes to using side-information in the competition framework, we are aware of only one method with proper error-rate control along with a precise implementation, namely the recently published Adaptive Knockoffs [43] (AdaPT [33] offers a meta-procedure, but without a precise implementation). ...

Knockoffs with side information
  • Citing Article
  • June 2023

The Annals of Applied Statistics

... This framework is very relevant for the online data collection in the tech setting, including the setting discussed by Johari et al. [2022]. They also prove to be useful in universal inference for irregular parametric models [Wasserman et al., 2020, Tse and Davison, 2022, Spector et al., 2023, Park et al., 2023. In general, e-values offer more robustness to get valid inference in the presence of various concerns about data-dependent experimental decisions, such as a post-hoc choice of α [Grünwald, 2024, Hemerik andKoning, 2024]. ...

A Discussion of Tse and Davidson (2022) “A Note on Universal Inference”
  • Citing Article
  • April 2023

Stat

... Given a testing query that uses LQO to generate its plan, represented by , and withˆ:= ( obs ,ˆ+ 1| ) realizing the constructed and predicted partial plans at step , an STL constraint , a robust semantics measure (.) for this constraint, and a constraint violation probability ∈ [0, 1]. Then, we can guarantee that these constructed and predicted partial plansˆso far will result in a complete plan that satisfies the constraint with a confidence level 1 − , i.e., Prob( |= ) ≥ 1 − , only if the robust semantics defined 5 Note that upper bounds on the robustness values can be adjusted in the distribution shift cases using the same approach in Section 3.2. Figure 4: CP-based Runtime Verification Framework. ...

Testing for outliers with conformal p-values
  • Citing Article
  • February 2023

The Annals of Statistics

... Assumption 1 is widely adopted in the literature of causal inference with unmeasured confounders (Jin et al., 2023;Yadlowsky et al., 2022) and as pointed out by Yadlowsky et al. (2022), such a latent confounder vector H should almost always exist. When unconfoundedness holds, conditioning on X alone is sufficient, and the latent confounder vector H can be set to be an empty set, meaning conditional randomness reduces to unconfoundedness. ...

Sensitivity analysis of individual treatment effects: A robust conformal inference approach

Proceedings of the National Academy of Sciences