Giles Hooker’s research while affiliated with University of Pennsylvania and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (168)


CLaRe: Compact near-lossless Latent Representations of High-Dimensional Object Data
  • Preprint
  • File available

February 2025

·

10 Reads

Emma Zohner

·

·

Giles Hooker

·

Jeffrey Morris

Latent feature representation methods play an important role in the dimension reduction and statistical modeling of high-dimensional complex data objects. However, existing approaches to assess the quality of these methods often rely on aggregated statistics that reflect the central tendency of the distribution of information losses, such as average or total loss, which can mask variation across individual observations. We argue that controlling average performance is insufficient to guarantee that statistical analysis in the latent space reflects the data-generating process and instead advocate for controlling the worst-case generalization error, or a tail quantile of the generalization error distribution. Our framework, CLaRe (Compact near-lossless Latent Representations), introduces a systematic way to balance compactness of the representation with preservation of information when assessing and selecting among latent feature representation methods. To facilitate the application of the CLaRe framework, we have developed GLaRe (Graphical Analysis of Latent Representations), an open-source R package that implements the framework and provides graphical summaries of the full generalization error distribution. We demonstrate the utility of CLaRe through three case studies on high-dimensional datasets from diverse fields of application. We apply the CLaRe framework to select among principal components, wavelets and autoencoder representations for each dataset. The case studies reveal that the optimal latent feature representation varies depending on dataset characteristics, emphasizing the importance of a flexible evaluation framework.

Download

Fig. 1. Plots for estimating demographic parity in simulation Setting 1. Each point represents an estimate from a simulated dataset. In both plots, the X-axis is sample size and the Y-axis is the estimated demographic parity. The dotted line represents the true value of demographic parity for the distribution. For the lower plot, the bands around each point represent the 95% confidence interval.
Fig. 2. Heatmap demonstrating coverage in different scenarios and as sample size varies. For each cell in the heatmap, coverage is calculated over 100 simulations.
Fig. 3. Line plot comparing targeted learning estimate of variance to a t-test estimate of variance.
Fig. 4. Line plot and heatmap demonstrating error and coverage results for CMI. In the line plots, the X-axis is the value of c and the Y-axis is the error. The solid horizontal line represents and error of 0. For every combination of (c,estimator type, sample size), we perform 100 simulations. The heatmap shows coverage for the TL estimator as c and sample size varies.
Fig. 5. Feature importance to fairness scores for both the Adult and Law school datasets. The first row contains feature importances for Adult, and the second row contains feature importances for Law school.
Targeted Learning for Data Fairness

February 2025

·

8 Reads

Data and algorithms have the potential to produce and perpetuate discrimination and disparate treatment. As such, significant effort has been invested in developing approaches to defining, detecting, and eliminating unfair outcomes in algorithms. In this paper, we focus on performing statistical inference for fairness. Prior work in fairness inference has largely focused on inferring the fairness properties of a given predictive algorithm. Here, we expand fairness inference by evaluating fairness in the data generating process itself, referred to here as data fairness. We perform inference on data fairness using targeted learning, a flexible framework for nonparametric inference. We derive estimators demographic parity, equal opportunity, and conditional mutual information. Additionally, we find that our estimators for probabilistic metrics exploit double robustness. To validate our approach, we perform several simulations and apply our estimators to real data.


Figure 1: Testing the winner and runner-up fails in the unequal variances case. Wide distribution rescaled for ease of interpretation.
Figure 2: Tests to verify winner. Shaded region depicts truncation event; p-value is mass of segments B/(A + B).
Gaussian Rank Verification

January 2025

·

4 Reads

Statistical experiments often seek to identify random variables with the largest population means. This inferential task, known as rank verification, has been well-studied on Gaussian data with equal variances. This work provides the first treatment of the unequal variances case, utilizing ideas from the selective inference literature. We design a hypothesis test that verifies the rank of the largest observed value without losing power due to multiple testing corrections. This test is subsequently extended for two procedures: Identifying some number of correctly-ordered Gaussian means, and validating the top-K set. The testing procedures are validated on NHANES survey data.


Targeted Maximum Likelihood Estimation for Integral Projection Models in Population Ecology

November 2024

·

14 Reads

Integral projection models (IPMs) are widely used to study population growth and the dynamics of demographic structure (e.g. age and size distributions) within a population.These models use data on individuals' growth, survival, and reproduction to predict changes in the population from one time point to the next and use these in turn to ask about long-term growth rates, the sensitivity of that growth rate to environmental factors, and aspects of the long term population such as how much reproduction concentrates in a few individuals; these quantities are not directly measurable from data and must be inferred from the model. Building IPMs requires us to develop models for individual fates over the next time step -- Did they survive? How much did they grow or shrink? Did they Reproduce? -- conditional on their initial state as well as on environmental covariates in a manner that accounts for the unobservable quantities that are are ultimately interested in estimating.Targeted maximum likelihood estimation (TMLE) methods are particularly well-suited to a framework in which we are largely interested in the consequences of models. These build machine learning-based models that estimate the probability distribution of the data we observe and define a target of inference as a function of these. The initial estimate for the distribution is then modified by tilting in the direction of the efficient influence function to both de-bias the parameter estimate and provide more accurate inference. In this paper, we employ TMLE to develop robust and efficient estimators for properties derived from a fitted IPM. Mathematically, we derive the efficient influence function and formulate the paths for the least favorable sub-models. Empirically, we conduct extensive simulations using real data from both long term studies of Idaho steppe plant communities and experimental Rotifer populations.


Targeted Learning for Variable Importance

November 2024

·

30 Reads

Variable importance is one of the most widely used measures for interpreting machine learning with significant interest from both statistics and machine learning communities. Recently, increasing attention has been directed toward uncertainty quantification in these metrics. Current approaches largely rely on one-step procedures, which, while asymptotically efficient, can present higher sensitivity and instability in finite sample settings. To address these limitations, we propose a novel method by employing the targeted learning (TL) framework, designed to enhance robustness in inference for variable importance metrics. Our approach is particularly suited for conditional permutation variable importance. We show that it (i) retains the asymptotic efficiency of traditional methods, (ii) maintains comparable computational complexity, and (iii) delivers improved accuracy, especially in finite sample contexts. We further support these findings with numerical experiments that illustrate the practical advantages of our method and validate the theoretical results.


It's about (taking up) space: Discreteness of individuals and the strength of spatial coexistence mechanisms

October 2024

·

144 Reads

Stephen P. Ellner

·

Robin E. Snyder

·

·

[...]

·

Giles Hooker

One strand of modern coexistence theory (MCT) partitions invader growth rates (IGR) to quantify how different mechanisms contribute to species coexistence, highlighting fluctuation‐dependent mechanisms. A general conclusion from the classical analytic MCT theory is that coexistence mechanisms relying on temporal variation (such as the temporal storage effect) are generally less effective at promoting coexistence than mechanisms relying on spatial or spatiotemporal variation (primarily growth‐density covariance). However, the analytic theory assumes continuous population density, and IGRs are calculated for infinitesimally rare invaders that have infinite time to find their preferred habitat and regrow, without ever experiencing intraspecific competition. Here we ask if the disparity between spatial and temporal mechanisms persists when individuals are, instead, discrete and occupy finite amounts of space. We present a simulation‐based approach to quantifying IGRs in this situation, building on our previous approach for spatially non‐varying habitats. As expected, we found that spatial mechanisms are weakened; unexpectedly, the contribution to IGR from growth‐density covariance could even become negative, opposing coexistence. We also found shifts in which demographic parameters had the largest effect on the strength of spatial coexistence mechanisms. Our substantive conclusions are statements about one model, across parameter ranges that we subjectively considered realistic. Using the methods developed here, effects of individual discreteness should be explored theoretically across a broader range of conditions, and in models parameterized from empirical data on real communities.


Distribution of sightings count by species.
Variable Importance Measures for Multivariate Random Forests

September 2024

·

59 Reads

Journal of data science: JDS

Multivariate random forests (or MVRFs) are an extension of tree-based ensembles to examine multivariate responses. MVRF can be particularly helpful where some of the responses exhibit sparse (e.g., zero-inflated) distributions, making borrowing strength from correlated features attractive. Tree-based algorithms select features using variable importance measures (VIMs) that score each covariate based on the strength of dependence of the model on that variable. In this paper, we develop and propose new VIMs for MVRFs. Specifically, we focus on the variable’s ability to achieve split improvement, i.e., the difference in the responses between the left and right nodes obtained after splitting the parent node, for a multivariate response. Our proposed VIMs are an improvement over the default naïve VIM in existing software and allow us to investigate the strength of dependence both globally and on a per-response basis. Our simulation studies show that our proposed VIM recovers the true predictors better than naïve measures. We demonstrate usage of the VIMs for variable selection in two empirical applications; the first is on Amazon Marketplace data to predict Buy Box prices of multiple brands in a category, and the second is on ecology data to predict co-occurrence of multiple, rare bird species. A feature of both data sets is that some outcomes are sparse — exhibiting a substantial proportion of zeros or fixed values. In both cases, the proposed VIMs when used for variable screening give superior predictive accuracy over naïve measures.


A generic approach for reproducible model distillation

August 2024

·

24 Reads

·

1 Citation

Machine Learning

Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable “student” model to mimic the predictions made by the black box “teacher” model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough sample of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed separately for each specific class of student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the estimated fidelity of the student to the teacher. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a sample size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. The code is publicly available at https://github.com/yunzhe-zhou/GenericDistillation.




Citations (46)


... Consequently, the explanations provided by LIME are sensitive to changes in the number of perturbed samples (Bansal et al., 2020). Similarly, Monte Carlo integration methods are often used to approximate Shapley values, which are also subject to sampling variability (Goldwasser & Hooker, 2023;Štrumbelj & Kononenko, 2013). ...

Reference:

How Interpretable Machine Learning Can Benefit Process Understanding in the Geosciences
Stabilizing Estimates of Shapley Values with Control Variates
  • Citing Chapter
  • July 2024

... There are many traditional scenarios where high precision is required as well, for example, in pharmacometrics it is a requirement during inference to solve the model to a much lower tolerance than what's generally used in simulation (tolerances of around 1e-10) in order to improve the robustness of the fitting schemes [3]. This is due to the fact that calculating derivatives of ODE solutions incurs error from the forward pass and induced solver error from the extended equations for the forward/adjoint pass, making higher forward simulation accuracy a requirement in order to achieve accurate gradients and thus a stable optimization scheme [4] [5]. ...

Differentiable Programming for Differential Equations: A Review

... The model probabilities of survival and transition represent averages within each stage, allowing us to calculate an annual projection matrix when the stable seasonal cycle is reached, using any month as the annual census time, as described below. This enables calculation of a range of life history and demographic outputs such as the stable stage structure and reproductive values, generation time, mean and variance of (remaining) lifespan, and the mean and variance of (remaining) lifetime reproduction, using matrix population model methods (Caswell, 2001(Caswell, , 2009Hern andez et al., 2024). Other models such as the model by Dobson et al. (2011) incorporate some transitions (such as molting) as fixed individual-based processes simulated outside of the projection matrix. ...

The natural history of luck: A synthesis study of structured population models

Ecology Letters

... Many post-hoc methods have been proposed in the literature, including counterfactual explanations [4,29,71], saliency maps [27,49,53,67], and model distillation [22,58]. Closer to our work, a few attempts have been made to explain a ViT architecture via post-hoc algorithms. ...

Considerations when learning additive explanations for black-box models

Machine Learning

... Several R packages support MPM analysis [e.g. popbio (Stubben & Milligan, 2007), popdemo (Stott et al., 2012), Rage (Jones et al., 2022), exactLTRE (Hernández et al., 2023)], but none provide a broad scope for simulating MPMs with specific characteristics. ...

An exact version of Life Table Response Experiment analysis, and the R package exactLTRE

... Inspired by stable approximation trees in model distillation [26], Zhou et al. [27] introduced S-LIME, an approach that tackles the challenge of stability in LIME and similar techniques. S-LIME leverages a hypothesis testing framework based on the central limit theorem to determine the optimal number of perturbed samples required to generate more stable explanations than traditional LIME. ...

Approximation trees: statistical reproducibility in model distillation

Data Mining and Knowledge Discovery

... Recently, a number of multivariate GBM models have been proposed, such as Delta Boosting (Lee and Lin 2018), NGBoost (Duan et al. 2020) and Cyclic Gradient Boosting (CGBM) (Delong et al. 2023). In Zhou and Hooker (2022), a GBMbased VCM is introduced where univariate GBMs are fitted to the partial derivatives simultaneously in each boosting iteration. Other examples of recent work on treebased VCMs that consider e.g. ...

Decision tree boosted varying coefficient models

Data Mining and Knowledge Discovery

... Similarly, quantitative genetics has successfully partitioned the additive heritable phenotypic variation into genetic and environmental components without knowing the specific loci responsible for this variation [66]. The study of competition and species coexistence now routinely uses a partitioning scheme that envisions counterfactual scenarios (e.g. a system in which no stabilizing mechanisms are present) to measure the impact of species differences in maintaining or eroding species diversity [67,68]. Variance partitioning schemes for population models can play a similar role, helping us to categorize and quantify the qualitative components of a dynamical system and improve our understanding of the mechanisms that drive fluctuations in population size over time. ...

Toward a “modern coexistence theory” for the discrete and spatial

... Functional data analysis is collected in monographs of [37], [8] and [20]. The ability to consider derivatives, a by-product of conceiving the data as functions, is an advantage for visualisation [39] and modelling [17]. It also gives rise to dynamic data analysis in [36] and [16]. ...

Selecting the derivative of a functional covariate in scalar-on-function regression

Statistics and Computing

... IPMs are similar to age-or stagebased models but allow the use of a continuous variable such as size, rather than discrete categories of age-class or stage. A strength of IPMs is that we can use relatively intuitive regression models to describe how size influences demographic rates (Ellner et al. 2016;Ellner et al. 2022). Lastly, we estimated lambda with and without the costs of reproduction. ...

A critical comparison of integral projection and matrix projection models for demographic analysis: Comment