Martin J. Wainwright’s research while affiliated with University of California, Berkeley and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (341)


Inference under Staggered Adoption: Case Study of the Affordable Care Act
  • Preprint
  • File available

December 2024

·

16 Reads

Eric Xia

·

Yuling Yan

·

Martin J. Wainwright

Panel data consists of a collection of N units that are observed over T units of time. A policy or treatment is subject to staggered adoption if different units take on treatment at different times and remains treated (or never at all). Assessing the effectiveness of such a policy requires estimating the treatment effect, corresponding to the difference between outcomes for treated versus untreated units. We develop inference procedures that build upon a computationally efficient matrix estimator for treatment effects in panel data. Our routines return confidence intervals (CIs) both for individual treatment effects, as well as for more general bilinear functionals of treatment effects, with prescribed coverage guarantees. We apply these inferential methods to analyze the effectiveness of Medicaid expansion portion of the Affordable Care Act. Based on our analysis, Medicaid expansion has led to substantial reductions in uninsurance rates, has reduced infant mortality rates, and has had no significant effects on healthcare expenditures.

Download

Figure 3. Predicting subsequential cardiovascular risk after heart attacks. Plots of the test set accuracy of binary classifiers based on training sets with varying fractions of labeled responses. The naive procedure (blue) makes use only of the labeled data points, whereas the other two curves are instantiations of the PAST procedure with either hard labels (20) (green) or stochastic soft labels (orange). For a given fraction of labeled responses, error bars are computed by re-running the estimator on training sets constructed by randomly choosing the subset of observations to be labeled.
Figure 4. Some examples of chest X-rays. (a) A healthy chest X-ray. (b) A chest X-ray of a patient with pneumonia. The arrows point to the white spot corresponding to fluid-filled air sacs in the lungs, indicative of pneumonia.
Forecasting societal ills empirical results (R 2 )
Prediction Aided by Surrogate Training

December 2024

·

5 Reads

We study a class of prediction problems in which relatively few observations have associated responses, but all observations include both standard covariates as well as additional "helper" covariates. While the end goal is to make high-quality predictions using only the standard covariates, helper covariates can be exploited during training to improve prediction. Helper covariates arise in many applications, including forecasting in time series; incorporation of biased or mis-calibrated predictions from foundation models; and sharing information in transfer learning. We propose "prediction aided by surrogate training" (PAST\texttt{PAST}), a class of methods that exploit labeled data to construct a response estimator based on both the standard and helper covariates; and then use the full dataset with pseudo-responses to train a predictor based only on standard covariates. We establish guarantees on the prediction error of this procedure, with the response estimator allowed to be constructed in an arbitrary way, and the final predictor fit by empirical risk minimization over an arbitrary function class. These upper bounds involve the risk associated with the oracle data set (all responses available), plus an overhead that measures the accuracy of the pseudo-responses. This theory characterizes both regimes in which PAST\texttt{PAST} accuracy is comparable to the oracle accuracy, as well as more challenging regimes where it behaves poorly. We demonstrate its empirical performance across a range of applications, including forecasting of societal ills over time with future covariates as helpers; prediction of cardiovascular risk after heart attacks with prescription data as helpers; and diagnosing pneumonia from chest X-rays using machine-generated predictions as helpers.



When is it worthwhile to jackknife? Breaking the quadratic barrier for Z-estimators

November 2024

·

3 Reads

Licong Lin

·

Fangzhou Su

·

·

[...]

·

Martin Wainwright

Resampling methods are especially well-suited to inference with estimators that provide only "black-box'' access. Jackknife is a form of resampling, widely used for bias correction and variance estimation, that is well-understood under classical scaling where the sample size n grows for a fixed problem. We study its behavior in application to estimating functionals using high-dimensional Z-estimators, allowing both the sample size n and problem dimension d to diverge. We begin showing that the plug-in estimator based on the Z-estimate suffers from a quadratic breakdown: while it is n\sqrt{n}-consistent and asymptotically normal whenever nd2n \gtrsim d^2, it fails for a broad class of problems whenever nd2n \lesssim d^2. We then show that under suitable regularity conditions, applying a jackknife correction yields an estimate that is n\sqrt{n}-consistent and asymptotically normal whenever nd3/2n\gtrsim d^{3/2}. This provides strong motivation for the use of jackknife in high-dimensional problems where the dimension is moderate relative to sample size. We illustrate consequences of our general theory for various specific Z-estimators, including non-linear functionals in linear models; generalized linear models; and the inverse propensity score weighting (IPW) estimate for the average treatment effect, among others.


Instrumental variables: A non-asymptotic viewpoint

October 2024

·

44 Reads

We provide a non-asymptotic analysis of the linear instrumental variable estimator allowing for the presence of exogeneous covariates. In addition, we introduce a novel measure of the strength of an instrument that can be used to derive non-asymptotic confidence intervals. For strong instruments, these non-asymptotic intervals match the asymptotic ones exactly up to higher order corrections; for weaker instruments, our intervals involve adaptive adjustments to the instrument strength, and thus remain valid even when asymptotic predictions break down. We illustrate our results via an analysis of the effect of PM2.5 pollution on various health conditions, using wildfire smoke exposure as an instrument. Our analysis shows that exposure to PM2.5 pollution leads to statistically significant increases in incidence of health conditions such as asthma, heart disease, and strokes.



Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

March 2024

·

4 Reads

·

15 Citations

Mathematical Statistics and Learning

We study stochastic approximation procedures for approximately solving a d -dimen­sional linear fixed-point equation based on observing a trajectory of length n from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order t_{\mathrm{mix}}\frac{d}{n} on the squared error of the last iterate of a standard scheme, where t_{\mathrm{mix}} is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters (d, t_{\mathrm{mix}}) in the higher-order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise—covering the \mathrm{TD}(\lambda) family of algorithms for all \lambda \in [0, 1) —and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of \lambda when running the \mathrm{TD}(\lambda) algorithm).


Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q -learning

January 2024

·

17 Reads

·

10 Citations

IEEE Transactions on Information Theory

Various algorithms in reinforcement learning exhibit dramatic variability in their convergence rates and ultimate accuracy as a function of the problem structure. Such instance-specific behavior is not captured by existing global minimax bounds, which are worst-case in nature. We analyze the problem of estimating optimal Q -state-action value functions for a discounted Markov decision process with discrete states and actions; our main result is to identify an instance-dependent functional that controls the difficulty of estimation in the ℓ -norm. Using a local minimax framework, we show that this functional arises in lower bounds on the accuracy on any estimation procedure. We establish the sharpness of these lower bounds, up to factors logarithmic in the state and action spaces, by analyzing a variance-reduced version of Q -learning. Our theory provides a precise way of distinguishing “easy” problems from “hard” ones in the context of Q -learning, as illustrated by an ensemble with a continuum of difficulty.



When is the estimated propensity score better? High-dimensional analysis and bias correction

March 2023

·

22 Reads

Anecdotally, using an estimated propensity score is superior to the true propensity score in estimating the average treatment effect based on observational data. However, this claim comes with several qualifications: it holds only if propensity score model is correctly specified and the number of covariates d is small relative to the sample size n. We revisit this phenomenon by studying the inverse propensity score weighting (IPW) estimator based on a logistic model with a diverging number of covariates. We first show that the IPW estimator based on the estimated propensity score is consistent and asymptotically normal with smaller variance than the oracle IPW estimator (using the true propensity score) if and only if nd2n \gtrsim d^2. We then propose a debiased IPW estimator that achieves the same guarantees in the regime nd3/2n \gtrsim d^{3/2}. Our proofs rely on a novel non-asymptotic decomposition of the IPW error along with careful control of the higher order terms.


Citations (47)


... The proof of this result is provided in the Appendix A. This result offers a broad generalization of [42,Equation (19b)] to any risk with Lipschitz-continuous sub-exponential gradients over any convex and compact set. Our result is comparable to the O( √ ℓ i m) rate that can be found for specific problem instances such as linear least squares regression and logistic regression, but with the addition of a √ log m factor. ...

Reference:

Solving Decision-Dependent Games by Learning From Feedback
A Diffusion Process Perspective on Posterior Contraction Rates for Parameters
  • Citing Article
  • June 2024

SIAM Journal on Mathematics of Data Science

... Consequently, we are unable to discuss DET-POMDP within the context of any of these existing categories. In addition to inconsistent environmental complexity [10], observability plays a significant role in distinguishing between difficult and simple decision problems. ...

Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q -learning
  • Citing Article
  • January 2024

IEEE Transactions on Information Theory

... For finite-time analysis, the properties of the fixed-point operator and the nature of the noise sequence play crucial roles. In particular, when the operator is linear or contractive with respect to some norm, and the noise process is either i.i.d. or forms a uniformly ergodic Markov chain, there is a rich body of literature establishing mean-square and high-probability bounds [9,21,22,26,32,54,55,59,60,65,70]. Beyond these standard settings, finite-time analysis of SA with seminorm contractive operators and non-expansive operators have been developed recently in [24] and [10,14], respectively. ...

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation
  • Citing Article
  • March 2024

Mathematical Statistics and Learning

... Covariate shift [6,13,25,14], as discussed in this paper, refers to a scenario where the distribution of input data changes between the training and test phases of a machine learning model. Specifically, this means that while the distribution of the input variables (covariates) varies, the conditional distribution of the output variable given these inputs remains the same. ...

Optimally tackling covariate shift in RKHS-based nonparametric regression
  • Citing Article
  • April 2023

The Annals of Statistics

... Besides the realizability of Q π , the guarantee also depends on the invertibility of A, which can be viewed as a coverage condition, since A changes with the data distribution µ [AJX20; AJS23; JX24]. In fact, in the on-policy setting (µ is an invariant distribution under π), σ min (A) can be shown to be lower-bounded away from 0 [MPW23]. ...

Optimal Oracle Inequalities for Projected Fixed-Point Equations, with Applications to Policy Evaluation
  • Citing Article
  • December 2022

Mathematics of Operations Research

... For the case of superlinearly growing drift f , the dimension dependence of the PLMC algorithm (2.17) is proved to be d max{3γ/2,2γ−1} , which is new in the literature. More precisely, the dimension dependence is d 3γ/2 for the case 1 ≤ γ ≤ 2 and d 2γ−1 for γ > 2. Whether the obtained dimension dependence can be further improved turns out to be an interesting topic [24] and would be a direction for our future work. ...

Improved bounds for discretization of Langevin diffusions: Near-optimal rates without convexity
  • Citing Article
  • August 2022

Bernoulli

... Estimating a linear functional of the treatment effect is of great importance in both the literature of causal inference and reinforcement learning (RL). For instance, in causal inference, one is interested in estimating the average treatment effect (ATE) [20] or their weighted variants, and in the bandits and RL literature, one is interested in estimating the expected reward of a target policy [38, 64,41,37]. Two main challenges arise when tackling this problem: ...

Minimax Off-Policy Evaluation for Multi-Armed Bandits
  • Citing Article
  • August 2022

IEEE Transactions on Information Theory

... It turns out that the problem of estimating the value-function corresponding to a given policy (policy evaluation), and that of finding the optimal policy (control), can be both cast as instances of SA [4]. In this context, the sample-complexities of SA-based RL algorithms like TD learning and Q-learning have been studied recently in [20][21][22][23][24][25][26]. This line of work has collectively revealed that for contemporary RL problems with large state and action-spaces, several samples are typically needed to achieve a desired performance accuracy. ...

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis
  • Citing Article
  • October 2021

SIAM Journal on Mathematics of Data Science

... Policy-based algorithms directly learn the policy without computing value functions. Notable examples include proximal policy optimization (PPO) [28] and actor-critic (A2C) [29]. PPO maximizes expected cumulative rewards by optimizing the policy while maintaining a balance between exploration and exploitation. ...

Provable Benefits of Actor-Critic Methods for Offline Reinforcement Learning
  • Citing Preprint
  • August 2021

... To address this issue, there has been a recent interest in the class of permutation-based models [6,24,25,17,8,12,21,27] where it is only assumed that the matrix M satisfies some shape-constrained conditions before one (or two) permutations acts on the rows (and possibly on the columns) of M . Quite surprisingly, it has been established in [25] that, at least in some settings, the matrix M can be estimated at the same rate in those non-parametric models as in classical parametric models by relying on the least-square estimator on the class of permuted bi-isotonic matrices. ...

Towards optimal estimation of bivariate isotonic matrices with unknown permutations
  • Citing Article
  • December 2020

The Annals of Statistics