Walter Gerych’s research while affiliated with Massachusetts Institute of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (49)


The Surprising Effectiveness of Infinite-Width NTKs for Characterizing and Improving Model Training
  • Article

April 2025

·

9 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

Joshua DeOliveira

·

Walter Gerych

·

Developments in deep neural nets have trended towards increasingly larger overparameterized architectures, resulting in lengthy training sessions with ever more elusive training dynamics. Thus, ensuring these models learn accurate generalizable representations of data efficiently is challenging. Previous works have developed specialized techniques from data-pruning, architecture selection, pseudo-label generation, bias identification, or label refurbishment to improve downstream training. Problematically, most methods require prohibitively expensive iterative model training. In this paper, we demonstrate that we can exploit the recent neural tangent kernel (NTK) theory for understanding and improving model training behavior before ever training a model. First, we show a powerful signal derived from the NTK theory can be computed remarkably fast. We then leverage this signal for the design of a unified suite of surprisingly effective tools for the four important tasks of architecture selection, pseudo-label verification, bias identification, and label refurbishment, all requiring zero model training.


Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models
  • Article
  • Full-text available

April 2025

·

3 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

Kyle Cox

·

·

Yikun Han

·

[...]

·

An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic concept space with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.

Download

FIGURE 2. Indoor (left) and outdoor (right) corrosion testing environments. Images are taken and released by Army Research Lab staff for use in this publication.
FIGURE 3. General DA training architecture. Each model differs based on their respective transfer loss function utilized.
FIGURE 5. Corrosion rating scale. Millimeter measurement ranges on bottom, corresponding discrete ratings on top.
FIGURE 7. (a) 8 test set images that were predicted incorrectly by the confidence-based mode ensemble. Ground truth (GT) and predicted (Pred) ratings by the model are shown on each image. (b) 6 test set images that were predicted incorrectly by the confidence-based median ensemble. Ground truth (GT) and predicted (Pred) ratings on each image.
Learning to Adapt Deep Corrosion Assessment Models From Indoor to Outdoor Image Domains

January 2025

·

22 Reads

IEEE Access

Corrosion of materials impacts critical economic sectors from infrastructure to transportation. The development of safe, corrosion-inhibiting materials is thus an important area of study in materials science. Traditional corrosion science, preparing and monitoring materials under adverse conditions, is labor intensive and extremely costly. While deep learning has become popular in automating engineering tasks, the development of deep models for corrosion assessment is lacking. We study the unique problem of deep domain adaptation (DA) for automated corrosion assessment of corrosion-inhibiting materials. Corrosion data, i.e., photographic images of corroding materials, is abundant when produced in an artificially controlled laboratory, while controlled images from natural outdoor environments are limited. We thus leverage the more readily available artificial-environment indoor data to train a corrosion assessment classifier to transfer it via domain adaptation. In doing so, we can perform well on the smaller, yet more realistic, outdoor corrosion data, without requiring target labels. We empirically evaluate 5 popular DA models on real-world corrosion image data. Further, we design 8 strategies for ensembling these models. Our study finds across evalaution metrics of accuracy, F1-score, and balanced accuracy that ensembled models incorporating both predictions and confidence scores from each DA model outperform individual DA models. They achieve 41% relative improvement in test accuracy compared to a no-DA baseline. Additionally, we perform a failure analysis study of our model to explain its performance.




MaskMedPaint: Masked Medical Image Inpainting with Diffusion Models for Mitigation of Spurious Correlations

November 2024

·

49 Reads

Spurious features associated with class labels can lead image classifiers to rely on shortcuts that don't generalize well to new domains. This is especially problematic in medical settings, where biased models fail when applied to different hospitals or systems. In such cases, data-driven methods to reduce spurious correlations are preferred, as clinicians can directly validate the modified images. While Denoising Diffusion Probabilistic Models (Diffusion Models) show promise for natural images, they are impractical for medical use due to the difficulty of describing spurious medical features. To address this, we propose Masked Medical Image Inpainting (MaskMedPaint), which uses text-to-image diffusion models to augment training images by inpainting areas outside key classification regions to match the target domain. We demonstrate that MaskMedPaint enhances generalization to target domains across both natural (Waterbirds, iWildCam) and medical (ISIC 2018, Chest X-ray) datasets, given limited unlabeled target images.


Debiasing the UTKFACE dataset with respect to gender and race for STEREOTYPE queries.
Debiasing the FAIRFACE dataset with respect to gender and race for STEREOTYPE queries.
Debiasing the CELEBA dataset with respect to gender for STEREOTYPE queries. We do not evaluate race on CELEBA as this dataset lacks race annotations.
Debiasing FAIRFACE with respect to HAIRCOLOR queries with respect to gender, but evaluated on race.
BendVLM: Test-Time Debiasing of Vision-Language Embeddings

November 2024

·

15 Reads

Vision-language model (VLM) embeddings have been shown to encode biases present in their training data, such as societal biases that prescribe negative characteristics to members of various racial and gender identities. VLMs are being quickly adopted for a variety of tasks ranging from few-shot classification to text-guided image generation, making debiasing VLM embeddings crucial. Debiasing approaches that fine-tune the VLM often suffer from catastrophic forgetting. On the other hand, fine-tuning-free methods typically utilize a "one-size-fits-all" approach that assumes that correlation with the spurious attribute can be explained using a single linear direction across all possible inputs. In this work, we propose Bend-VLM, a nonlinear, fine-tuning-free approach for VLM embedding debiasing that tailors the debiasing operation to each unique input. This allows for a more flexible debiasing approach. Additionally, we do not require knowledge of the set of inputs a priori to inference time, making our method more appropriate for online, open-set tasks such as retrieval and text guided image generation.


Identifying Implicit Social Biases in Vision-Language Models

November 2024

·

8 Reads

Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential downstream harm. In this work, we conduct a systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities. We first propose a taxonomy of social biases called So-B-IT, which contains 374 words categorized across ten types of bias. Each type can lead to societal harm if associated with a particular demographic group. Using this taxonomy, we examine images retrieved by CLIP from a facial image dataset using each word as part of a prompt. We find that CLIP frequently displays undesirable associations between harmful words and specific demographic groups, such as retrieving mostly pictures of Middle Eastern men when asked to retrieve images of a "terrorist". Finally, we conduct an analysis of the source of such biases, by showing that the same harmful stereotypes are also present in a large image-text dataset used to train CLIP models for examples of biases that we find. Our findings highlight the importance of evaluating and addressing bias in vision-language models, and suggest the need for transparency and fairness-aware curation of large pre-training datasets.


Identifying Implicit Social Biases in Vision-Language Models

October 2024

·

9 Reads

·

8 Citations

Proceedings of the AAAI/ACM Conference on AI Ethics and Society

Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential downstream harm. In this work, we conduct a systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities. We first propose a taxonomy of social biases called So-B-It, which contains 374 words categorized across ten types of bias. Each type can lead to societal harm if associated with a particular demographic group. Using this taxonomy, we examine images retrieved by CLIP from a facial image dataset using each word as part of a prompt. We find that CLIP frequently displays undesirable associations between harmful words and specific demographic groups, such as retrieving mostly pictures of Middle Eastern men when asked to retrieve images of a "terrorist". Finally, we conduct an analysis of the source of such biases, by showing that the same harmful stereotypes are also present in a large image-text dataset used to train CLIP models for examples of biases that we find. Our findings highlight the importance of evaluating and addressing bias in vision-language models, and suggest the need for transparency and fairness-aware curation of large pre-training datasets.


Amalgamating Multi-Task Models with Heterogeneous Architectures

March 2024

·

5 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

Multi-task learning (MTL) is essential for real-world applications that handle multiple tasks simultaneously, such as selfdriving cars. MTL methods improve the performance of all tasks by utilizing information across tasks to learn a robust shared representation. However, acquiring sufficient labeled data tends to be extremely expensive, especially when having to support many tasks. Recently, Knowledge Amalgamation (KA) has emerged as an effective strategy for addressing the lack of labels by instead learning directly from pretrained models (teachers). KA learns one unified multi-task student that masters all tasks across all teachers. Existing KA for MTL works are limited to teachers with identical architectures, and thus propose layer-to-layer based approaches. Unfortunately, in practice, teachers may have heterogeneous architectures; their layers may not be aligned and their dimensionalities or scales may be incompatible. Amalgamating multi-task teachers with heterogeneous architectures remains an open problem. For this, we design Versatile Common Feature Consolidator (VENUS), the first solution to this problem. VENUS fuses knowledge from the shared representations of each teacher into one unified generalized representation for all tasks. Specifically, we design the Feature Consolidator network that leverages an array of teacher-specific trainable adaptors. These adaptors enable the student to learn from multiple teachers, even if they have incompatible learned representations. We demonstrate that VENUS outperforms five alternative methods on numerous benchmark datasets across a broad spectrum of experiments.


Citations (25)


... 72 While some techniques are available and effective, using balanced data or incorporating representative variables is often the most effective approach for simultaneously maintaining model performance, and as such has been the focus of our review. [73][74][75] Despite the availability of these mitigation strategies, the primary challenge lies in raising awareness and ensuring their widespread adoption among researchers. To address this issue, the implementation of standardised guidelines is crucial. ...

Reference:

Sex bias consideration in healthcare machine-learning research: a systematic review in rheumatoid arthritis
A data-centric perspective to fair machine learning for healthcare
  • Citing Article
  • November 2024

Nature Reviews Methods Primers

... Various studies have documented specific biases in these models: Mandal et al. [Mandal et al.(2023)] found that vision transformers (ViTs) amplify gender biases more than convolutional neural networks (CNNs), while Fraser et al. [Fraser and Kiritchenko(2024)] observed gender and race biases across scenarios with synthetically generated images. Research on CLIP models has revealed multiple bias dimensions, including societal categories such as race, gender, and ethnicity [Hamidieh et al.(2024)], as well as cultural biases favoring Western norms [Ananthram et al.(2024)]. Additionally, studies have shown that different genders and ethnicities are associated with distinct sentiments in model outputs [Capitani et al.(2024)]. ...

Identifying Implicit Social Biases in Vision-Language Models
  • Citing Article
  • October 2024

Proceedings of the AAAI/ACM Conference on AI Ethics and Society

... Knowledge Learning and Editing Knowledge editing (Zhang et al., 2024b;Jiang et al., 2024a;Sun et al., 2024;Hsueh et al., 2024;Powell et al., 2024;Rozner et al., 2024;Wang et al., 2024f;Shi et al., 2024;Huang et al., 2024;Guo et al., 2024;Wang et al., 2025b;Feng et al., 2025;Yang et al., 2025) has emerged as a promising approach for updating models in an ever-changing world. Current knowledge editing methods typically follow one of several strategies: modifying the MLP components in earlier layers (Meng et al., 2022(Meng et al., , 2023, enhancing the MLP in later layers (Hartvigsen et al., 2023), or retrieving relevant facts as prompts Zhong et al., 2023). ...

TAXI: Evaluating Categorical Knowledge Editing for Language Models
  • Citing Conference Paper
  • January 2024

... It is beyond the boundary line of computational creativity, showcasing increasingly amazing examples each year [23]. While GANs have shown remarkable power and interest, it is also limited by various challenges: difficulties in achieving stable training of the GAN [9,27], the vanishing gradient, and the mode collapse [1]. Efforts from both the academic community and the commercial sector have been made to address these problems [8,16,25,34,36]. ...

Stabilizing Adversarial Training for Generative Networks
  • Citing Conference Paper
  • December 2023

... A study was conducted on capturing the interdependence of labels in multiple-label classification, where an example can be assigned multiple labels simultaneously. This study also demonstrated that effectively managing the complexities associated with labels necessitates the use of advanced techniques, particularly in cases where certain labels are limited or require additional contextual information for accurate classification [35]. Nevertheless, employing techniques like synthetic data generation has been demonstrated to improve the performance of the model when dealing with imbalanced and diverse labels. ...

Knowledge Amalgamation for Multi-Label Classification via Label Dependency Transfer
  • Citing Article
  • June 2023

Proceedings of the AAAI Conference on Artificial Intelligence

... The distribution-embedded deep neural network (DDNN) [76] is a state-of-the-art network featuring learning approaches for activity recognition. • Triple-DARE [77] is a neural network method that combines three unique loss functions to enhance intra-class compactness and inter-class separation within the embedding space of multi-labeled datasets. In this experiment, we slightly modify the model from lab-to-field transfer to cross-individual transfer to make the settings the same as the proposed model. ...

Domain Adaptation Methods for Lab-to-Field Human Context Recognition

... Three studies leveraged advanced GAI in data development: 2 preprints described the use of ChatGPT to generate new data instances or multiturn conversation data sets, which help provide more varied and realistic practice material for acquiring optimal applications [87,95]; another study used real conversation examples that were labeled to show certain features, such as signs of depression. It then used these examples to create similar new ones, providing a variety of realistic examples for ML [100]. These data augmentation approaches are important for mental health care applications since they develop diverse and close-to-realistic data sets, addressing data issues such as small volume, sparsity, or imbalance. ...

Text Generation to Aid Depression Detection: A Comparative Study of Conditional Sequence Generative Adversarial Networks
  • Citing Conference Paper
  • December 2022

... [131] use tabular GANs to generate synthetic samples for the minority class in imbalanced survival datasets. [132] proposed HAR-CTGAN to generate synthetic data to handle class imbalance in human activity recognition data.It focuses on the synthesizing continuous features, such as real-number data recorded from various sensors. [133] used GANs to augment and synthesize data for balancing the cardiovascular disease prediction dataset. ...

HAR-CTGAN: A Mobile Sensor Data Generation Tool for Human Activity Recognition
  • Citing Conference Paper
  • December 2022

... The architectural design enables the generation of a range of visual reports that facilitate a deeper comprehension of learner behavior and performance across diverse digital platforms. In the context of digital phenotyping, a study employed a range of sophisticated visualization techniques to contextualize and interpret low-level sensor data collected from smartphones [14]. The visualizations permit analysts to explore and interpret intricate, context-rich datasets, thereby facilitating the discovery of behavioral patterns, or "phone-o-types", that can be predictive of health outcomes. ...

INPHOVIS: Interactive visual analytics for smartphone-based digital phenotyping
  • Citing Article
  • January 2023

Visual Informatics

... Early classification of time series [1,2,3,4] is a pivotal algorithm, especially when sampling cost is high, e.g., medical early diagnosis [5], autonomous driving [6], and action recognition [7]. Under these applications, the early classifier seeks to optimize both speed and accuracy at the same time. ...

Stop&Hop: Early Classification of Irregular Time Series
  • Citing Article
  • October 2022