Walter Gerych’s research while affiliated with Massachusetts Institute of Technology and other places


Ad

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (46)


GAN Stabilization Under Practical Training Assumptions
  • Conference Paper

December 2024

·

7 Reads

Joshua DeOliveira

·

Walter Gerych

·



Figure 2: MaskMedPaint Image Generation for ISIC 2018 Dermoscopic Shift. All possible combinations of skin lesion conditions and rulers in the generated augmentations.
Figure 3: MaskMedPaint Image Generation for CXR dataset shift. Example of the source MIMIC-CXR image (left) augmented to NIH style with MaskMedPaint (middle). For reference, a CXR from NIH (right).
Figure 4: MaskMedPaint Image Generation for Waterbirds Shift. All possible combinations of bird species and backgrounds in the generated augmentations. Landbirds on water and waterbirds on land are categories not present in the original source dataset and are counterfactuals generated by MaskMedPaint.
Figure 5: MaskMedPaint Image Generation for iWildcam Shifts. Example of the original source image (left) adapted to the style of the target domain through MaskMedPaint augmentation (middle). For reference, we provide the real target domain images from the same class (right).
Figure 6: Number of Real versus Generated Images Added (Waterbirds). The overall test accuracy (left), source accuracy (middle), and target accuracy (right) of adding 10, 20, 50, 100, and 200 real images from the target distribution. As dashed lines, we have the mean accuracy of adding 1000 (purple), 2500 (yellow), and 5000 (red) MaskMedPaint generated images. CIs over 5 seeds.

+5

MaskMedPaint: Masked Medical Image Inpainting with Diffusion Models for Mitigation of Spurious Correlations
  • Preprint
  • File available

November 2024

·

47 Reads

Spurious features associated with class labels can lead image classifiers to rely on shortcuts that don't generalize well to new domains. This is especially problematic in medical settings, where biased models fail when applied to different hospitals or systems. In such cases, data-driven methods to reduce spurious correlations are preferred, as clinicians can directly validate the modified images. While Denoising Diffusion Probabilistic Models (Diffusion Models) show promise for natural images, they are impractical for medical use due to the difficulty of describing spurious medical features. To address this, we propose Masked Medical Image Inpainting (MaskMedPaint), which uses text-to-image diffusion models to augment training images by inpainting areas outside key classification regions to match the target domain. We demonstrate that MaskMedPaint enhances generalization to target domains across both natural (Waterbirds, iWildCam) and medical (ISIC 2018, Chest X-ray) datasets, given limited unlabeled target images.

Download

Debiasing the UTKFACE dataset with respect to gender and race for STEREOTYPE queries.
Debiasing the FAIRFACE dataset with respect to gender and race for STEREOTYPE queries.
Debiasing the CELEBA dataset with respect to gender for STEREOTYPE queries. We do not evaluate race on CELEBA as this dataset lacks race annotations.
Debiasing FAIRFACE with respect to HAIRCOLOR queries with respect to gender, but evaluated on race.
BendVLM: Test-Time Debiasing of Vision-Language Embeddings

November 2024

·

10 Reads

Walter Gerych

·

Haoran Zhang

·

Kimia Hamidieh

·

[...]

·

Vision-language model (VLM) embeddings have been shown to encode biases present in their training data, such as societal biases that prescribe negative characteristics to members of various racial and gender identities. VLMs are being quickly adopted for a variety of tasks ranging from few-shot classification to text-guided image generation, making debiasing VLM embeddings crucial. Debiasing approaches that fine-tune the VLM often suffer from catastrophic forgetting. On the other hand, fine-tuning-free methods typically utilize a "one-size-fits-all" approach that assumes that correlation with the spurious attribute can be explained using a single linear direction across all possible inputs. In this work, we propose Bend-VLM, a nonlinear, fine-tuning-free approach for VLM embedding debiasing that tailors the debiasing operation to each unique input. This allows for a more flexible debiasing approach. Additionally, we do not require knowledge of the set of inputs a priori to inference time, making our method more appropriate for online, open-set tasks such as retrieval and text guided image generation.


Identifying Implicit Social Biases in Vision-Language Models

November 2024

·

8 Reads

Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential downstream harm. In this work, we conduct a systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities. We first propose a taxonomy of social biases called So-B-IT, which contains 374 words categorized across ten types of bias. Each type can lead to societal harm if associated with a particular demographic group. Using this taxonomy, we examine images retrieved by CLIP from a facial image dataset using each word as part of a prompt. We find that CLIP frequently displays undesirable associations between harmful words and specific demographic groups, such as retrieving mostly pictures of Middle Eastern men when asked to retrieve images of a "terrorist". Finally, we conduct an analysis of the source of such biases, by showing that the same harmful stereotypes are also present in a large image-text dataset used to train CLIP models for examples of biases that we find. Our findings highlight the importance of evaluating and addressing bias in vision-language models, and suggest the need for transparency and fairness-aware curation of large pre-training datasets.


Identifying Implicit Social Biases in Vision-Language Models

October 2024

·

9 Reads

·

6 Citations

Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential downstream harm. In this work, we conduct a systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities. We first propose a taxonomy of social biases called So-B-It, which contains 374 words categorized across ten types of bias. Each type can lead to societal harm if associated with a particular demographic group. Using this taxonomy, we examine images retrieved by CLIP from a facial image dataset using each word as part of a prompt. We find that CLIP frequently displays undesirable associations between harmful words and specific demographic groups, such as retrieving mostly pictures of Middle Eastern men when asked to retrieve images of a "terrorist". Finally, we conduct an analysis of the source of such biases, by showing that the same harmful stereotypes are also present in a large image-text dataset used to train CLIP models for examples of biases that we find. Our findings highlight the importance of evaluating and addressing bias in vision-language models, and suggest the need for transparency and fairness-aware curation of large pre-training datasets.


Amalgamating Multi-Task Models with Heterogeneous Architectures

March 2024

·

4 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

Multi-task learning (MTL) is essential for real-world applications that handle multiple tasks simultaneously, such as selfdriving cars. MTL methods improve the performance of all tasks by utilizing information across tasks to learn a robust shared representation. However, acquiring sufficient labeled data tends to be extremely expensive, especially when having to support many tasks. Recently, Knowledge Amalgamation (KA) has emerged as an effective strategy for addressing the lack of labels by instead learning directly from pretrained models (teachers). KA learns one unified multi-task student that masters all tasks across all teachers. Existing KA for MTL works are limited to teachers with identical architectures, and thus propose layer-to-layer based approaches. Unfortunately, in practice, teachers may have heterogeneous architectures; their layers may not be aligned and their dimensionalities or scales may be incompatible. Amalgamating multi-task teachers with heterogeneous architectures remains an open problem. For this, we design Versatile Common Feature Consolidator (VENUS), the first solution to this problem. VENUS fuses knowledge from the shared representations of each teacher into one unified generalized representation for all tasks. Specifically, we design the Feature Consolidator network that leverages an array of teacher-specific trainable adaptors. These adaptors enable the student to learn from multiple teachers, even if they have incompatible learned representations. We demonstrate that VENUS outperforms five alternative methods on numerous benchmark datasets across a broad spectrum of experiments.


Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search

March 2024

·

10 Reads

·

1 Citation

Proceedings of the AAAI Conference on Artificial Intelligence

There are increasingly many large language models (LLMs) available to the public. While these LLMs have exhibited impressive abilities on a variety of task, any individual LLM in particular may do well on some tasks and worse on others. Additionally, the performance of these models is heavily dependent on the choice of prompt template used. For instance, they exhibit sensitivity to the few shot examples chosen or brittleness to the wording of instructions. Moreover, a prompt template that makes a model perform well for one input may not be the optimal template for another input. This necessitates an approach for adaptively selecting LLM and prompt template pairs for each input. Recent work has shown that the accuracy of LLM's responses is correlated with the LLM's confidence in the response. Thus, a natural choice for selecting which model and prompt template to use is to select the pair that is most confident in its response. However, existing confidence metrics are expensive to calculate - necessitating multiple calls to each LLm and prompt pair. We thus propose an approach to predict the confidence of each pair using an auxiliary regression model that is inexpensive to run. Using this auxiliary model, we select the LLM and prompt template with the highest predicted confidence for a given input. Results on a range of benchmark datasets show that our confidence-based instance-level prompt search method consistently improves the performance of LLMs.




Ad

Citations (25)


... 72 While some techniques are available and effective, using balanced data or incorporating representative variables is often the most effective approach for simultaneously maintaining model performance, and as such has been the focus of our review. [73][74][75] Despite the availability of these mitigation strategies, the primary challenge lies in raising awareness and ensuring their widespread adoption among researchers. To address this issue, the implementation of standardised guidelines is crucial. ...

Reference:

Sex bias consideration in healthcare machine-learning research: a systematic review in rheumatoid arthritis
A data-centric perspective to fair machine learning for healthcare
  • Citing Article
  • November 2024

Nature Reviews Methods Primers

... More recently, efforts have expanded to multimodal models and datasets, addressing biases in various languagevision tasks. These investigations have explored biases in embeddings [25], text-to-image (TTI) generation [5,11,18,23,52,62,64], image retrieval [61], image captioning [27,65], and visual question-answering models [1,28,44]. Despite these advances, research on intersectional biases in TTI models remains limited. ...

Identifying Implicit Social Biases in Vision-Language Models
  • Citing Article
  • October 2024

... Other work argues that this is a sufficient criterion for LLMs having their own beliefs (Hofweber et al., 2024). Importantly, this optimization pressure seems to shape model outputs to be more human-like in the sense that they comprise a somewhat coherent worldview, though LLM outputs are still much less coherent than humans Powell et al., 2024). Thus, an RLHF-trained LLM could possess beliefs of its own, making it an appropriate candidate for belief revision, but it is not known how much current truth-oriented finetuning processes shape LLMs to have their own beliefs rather than simulate beliefs of the authors of their pretraining data. ...

TAXI: Evaluating Categorical Knowledge Editing for Language Models
  • Citing Conference Paper
  • January 2024

... It is beyond the boundary line of computational creativity, showcasing increasingly amazing examples each year [23]. While GANs have shown remarkable power and interest, it is also limited by various challenges: difficulties in achieving stable training of the GAN [9,27], the vanishing gradient, and the mode collapse [1]. Efforts from both the academic community and the commercial sector have been made to address these problems [8,16,25,34,36]. ...

Stabilizing Adversarial Training for Generative Networks
  • Citing Conference Paper
  • December 2023

... A study was conducted on capturing the interdependence of labels in multiple-label classification, where an example can be assigned multiple labels simultaneously. This study also demonstrated that effectively managing the complexities associated with labels necessitates the use of advanced techniques, particularly in cases where certain labels are limited or require additional contextual information for accurate classification [35]. Nevertheless, employing techniques like synthetic data generation has been demonstrated to improve the performance of the model when dealing with imbalanced and diverse labels. ...

Knowledge Amalgamation for Multi-Label Classification via Label Dependency Transfer
  • Citing Article
  • June 2023

Proceedings of the AAAI Conference on Artificial Intelligence

... The distribution-embedded deep neural network (DDNN) [76] is a state-of-the-art network featuring learning approaches for activity recognition. • Triple-DARE [77] is a neural network method that combines three unique loss functions to enhance intra-class compactness and inter-class separation within the embedding space of multi-labeled datasets. In this experiment, we slightly modify the model from lab-to-field transfer to cross-individual transfer to make the settings the same as the proposed model. ...

Domain Adaptation Methods for Lab-to-Field Human Context Recognition
Sensors

... Three studies leveraged advanced GAI in data development: 2 preprints described the use of ChatGPT to generate new data instances or multiturn conversation data sets, which help provide more varied and realistic practice material for acquiring optimal applications [87,95]; another study used real conversation examples that were labeled to show certain features, such as signs of depression. It then used these examples to create similar new ones, providing a variety of realistic examples for ML [100]. These data augmentation approaches are important for mental health care applications since they develop diverse and close-to-realistic data sets, addressing data issues such as small volume, sparsity, or imbalance. ...

Text Generation to Aid Depression Detection: A Comparative Study of Conditional Sequence Generative Adversarial Networks
  • Citing Conference Paper
  • December 2022

... [131] use tabular GANs to generate synthetic samples for the minority class in imbalanced survival datasets. [132] proposed HAR-CTGAN to generate synthetic data to handle class imbalance in human activity recognition data.It focuses on the synthesizing continuous features, such as real-number data recorded from various sensors. [133] used GANs to augment and synthesize data for balancing the cardiovascular disease prediction dataset. ...

HAR-CTGAN: A Mobile Sensor Data Generation Tool for Human Activity Recognition
  • Citing Conference Paper
  • December 2022

... The architectural design enables the generation of a range of visual reports that facilitate a deeper comprehension of learner behavior and performance across diverse digital platforms. In the context of digital phenotyping, a study employed a range of sophisticated visualization techniques to contextualize and interpret low-level sensor data collected from smartphones [14]. The visualizations permit analysts to explore and interpret intricate, context-rich datasets, thereby facilitating the discovery of behavioral patterns, or "phone-o-types", that can be predictive of health outcomes. ...

INPHOVIS: Interactive visual analytics for smartphone-based digital phenotyping
  • Citing Article
  • January 2023

Visual Informatics

... Early classification of time series [1,2,3,4] is a pivotal algorithm, especially when sampling cost is high, e.g., medical early diagnosis [5], autonomous driving [6], and action recognition [7]. Under these applications, the early classifier seeks to optimize both speed and accuracy at the same time. ...

Stop&Hop: Early Classification of Irregular Time Series
  • Citing Article
  • October 2022