Michael Cogswell’s research while affiliated with SRI International and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (27)


A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
  • Conference Paper

February 2025

·

1 Read

Matthew Gwilliam

·

Michael Cogswell

·

Meng Ye

·

[...]

·






Figure 1: Two examples from CURE . Besides the high-level inference about the images (e.g., The girl is turning two years old today.), it also contains CoT reasoning chains to evaluate VLMs' reasoning performance and consistency. We only show 2 candidate options (of 6 in total) for presentation. More examples are shown in Figure 9.
Figure 3: The word cloud of the visual clues.
Figure 4: Question distribution.
Figure 11: The prompt used to guide LLMs for the generation of candidate answers for the CoT subquestions. "HumanAnnotated Visual Clue" is the human annotation result in the original Sherlock dataset.
Figure 12: The prompt used to guide LLMs for the filtering of inconsistent reasoning chains. "HumanAnnotated Visual Clue" and "Human-Annotated HighLevel Inference" are human annotation results in the original Sherlock dataset.

+3

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
  • Preprint
  • File available

September 2023

·

82 Reads

Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency.

Download

Figure 1: This benchmark presents three datasets to evaluate V+L models on relational, compositional and contextual understanding. They utilize image-text matching tasks with predicate, object/subject, compositions, or background swaps. These examples are where CLIP is successful for contextual but not relational or compositional.
Probing Conceptual Understanding of Large Visual-Language Models

April 2023

·

34 Reads

We present a novel framework for probing and improving relational, compositional and contextual understanding of large visual-language models (V+L). While large V+L models have achieved success in various downstream tasks, it is not clear if they have a conceptual grasp of the content. We propose a novel benchmarking dataset for probing three aspects of content understanding. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We have experimented with 5 well known models, such as CLIP and ViLT, and found that they mostly fail to demonstrate a conceptual understanding. That said, we find interesting insights such as cross-attention helps learning conceptual understanding. We use these insights to propose a new finetuning technique that rewards the three conceptual understanding measures we proposed. We hope that the presented benchmarks will help the community assess and improve the conceptual understanding capabilities of large V+L models.


Generating and Evaluating Explanations of Attended and Error‐Inducing Input Regions for VQA Models

November 2021

·

32 Reads

·

7 Citations

Attention maps, a popular heatmap‐based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we propose Error Maps that clarify the error by highlighting image regions where the model is prone to err. Error maps can indicate when a correctly attended region may be processed incorrectly leading to an incorrect answer, and hence, improve users’ understanding of those cases. To evaluate our new explanations, we further introduce a metric that simulates users’ interpretation of explanations to evaluate their potential helpfulness to understand model correctness. We finally conduct user studies to see that our new explanations help users understand model correctness better than baselines by an expected 30% and that our proxy helpfulness metrics correlate strongly (𝜌 > 0.97) with how well users can predict model correctness.


While alternative real images may present a convincing counterfactual case for a visual question answering (VQA) model, they are expensive to harvest and also often incapable of selecting specific features. In this sample, while the real‐image counterfactual may suggest that the AI agent is correctly capturing the type of sport, the in‐painted counterfactual suggests that the change in the answer is not necessarily correlated to the changes in the input
Generating counterfactual images based on human annotation attentions. The algorithm first identifies the most attended and least attended bounding boxes in the image and then applies the generative adversarial networks (GAN) to in‐paint those bounding boxes and produce the counterfactual images
The interfaces for the experiments that evaluate the impact of in‐painted counterfactuals for the task of answer‐change prediction. Users in both groups are evaluated based on the same in‐painting patterns. While the users in the Counterfactual Groups can utilize the counterfactual samples in their prediction, the baseline group attempts to predict the answer‐change merely based on the original image‐question (IQ) response. For the input and sample images, users see AI's top answer along with its probability (blue bar beneath the answers)
The workflow for different groups of the study. While steps 1 and 3 are shared among groups, the explanation step differentiates between them. In in‐painted counterfactuals, samples 1 and 2 are in‐painted over the least attended and most attended areas, respectively. The real counterfactual images are sampled from the VQA data set
Improving Users’ Mental Model with Attention‐directed Counterfactual Edits

November 2021

·

37 Reads

·

14 Citations

In the domain of Visual Question Answering (VQA), studies have shown improvement in users’ mental model of the VQA system when they are exposed to examples of how these systems answer certain Image-Question (IQ) pairs¹. In this work, we show that showing controlled counterfactual image-question examples are more effective at improving the mental model of users as compared to simply showing random examples. We compare a generative approach and a retrieval-based approach to show counterfactual examples. We use recent advances in generative adversarial networks (GANs) to generate counterfactual images by deleting and inpainting certain regions of interest in the image. We then expose users to changes in the VQA system’s answer on those altered images. To select the region of interest for inpainting, we experiment with using both human-annotated attention maps and a fully automatic method that uses the VQA system’s attention values. Finally, we test the user’s mental model by asking them to predict the model’s performance on a test counterfactual image. We note an overall improvement in users’ accuracy to predict answer change when shown counterfactual explanations. While realistic retrieved counterfactuals obviously are the most effective at improving the mental model, we show that a generative approach can also be equally effective.


Figure 6: Examples of recovered triggers overlaid on clean images. From left to right: (a) clean image, (b) triggers recovered by (Wang et al., 2019), (c) triggers recovered by (Liu et al., 2019), (d) triggers recovered by (Guo et al., 2019), (e) triggers recovered by our method without topological prior, and (f) triggers recovered by our method with topological prior.
Figure 7: Ablation study results for λ 2 .
Figure 8: Our Trojan detection method combines bottom-up Trigger reverse engineering under topological constraints, with top-down classification. Such a combination allows us to accurately isolate Trojan triggers from non-Trojan patterns such as adversarial noise and object modifications.
Figure 9: Reverse engineering of global color filter triggers.
Performance comparison on the TrojAI dataset.
Trigger Hunting with a Topological Prior for Trojan Detection

October 2021

·

49 Reads

Despite their success and popularity, deep neural networks (DNNs) are vulnerable when facing backdoor attacks. This impedes their wider adoption, especially in mission critical applications. This paper tackles the problem of Trojan detection, namely, identifying Trojaned models -- models trained with poisoned data. One popular approach is reverse engineering, i.e., recovering the triggers on a clean image by manipulating the model's prediction. One major challenge of reverse engineering approach is the enormous search space of triggers. To this end, we propose innovative priors such as diversity and topological simplicity to not only increase the chances of finding the appropriate triggers but also improve the quality of the found triggers. Moreover, by encouraging a diverse set of trigger candidates, our method can perform effectively in cases with unknown target labels. We demonstrate that these priors can significantly improve the quality of the recovered triggers, resulting in substantially improved Trojan detection accuracy as validated on both synthetic and publicly available TrojAI benchmarks.


Citations (17)


... The rapid progress in Large Language Models (LLMs) has significantly boosted AI capabilities, particularly in text generation and complex reasoning [7,14,17,50,51,53,67,68,75]. Expanding these models into multimodal applications, Multimodal LLMs (MLLMs) have shown strong performance across image and short video tasks like captioning, question answering, and segmentation [10,11,16,34,35,37,57,58,62,72,93]. However, as video content lengthens, these models encounter significant challenges due to context length limitations, which restrict their ability to process multiple frames and capture complex, extended temporal interactions. ...

Reference:

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Probing Conceptual Understanding of Large Visual-Language Models
  • Citing Conference Paper
  • June 2024

... To improve the ability of LVLMs to resist malicious queries, recent research has explored various safety alignment strategies [10], [11], [19]. A straightforward approach is to train LVLMs to refuse harmful queries using methods such as Reinforcement Learning from Human Feedback (RLHF) [12], [13], [32] or Supervised Fine-Tuning (SFT) [8], [33], [34], with the goal of reinforcing safe behaviors across diverse inputs. While such approaches can be effective, they require substantial human annotation and computational resources, which limits their scalability and adaptability. ...

DRESS : Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
  • Citing Conference Paper
  • June 2024

... Reasoning in LVLMs: Inspired by the success of CoT prompting and training in LLMs, several works (Cheng et al., 2024;Chen et al., 2024b;Shen et al., 2025) have made progress in boosting LVLM performance by incorporating curated CoT data during training. Alibaba released QVQ (Qwen Team, 2024a), a reasoning LVLM along the lines of QwQ (Qwen Team, 2024a) and trained via an RL-based approach. ...

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
  • Citing Conference Paper
  • January 2024

... Diversity [20,21] and controllability [12,14] are two key attributes that have received widespread attention in previous IC research. Recent findings indicate that titles generated by supervised methods tend to be more generalized, capturing the most common language patterns and words from the training corpus, a phenomenon referred to as the pattern collapse issue. ...

Diverse Beam Search for Improved Description of Complex Scenes
  • Citing Article
  • April 2018

Proceedings of the AAAI Conference on Artificial Intelligence

... Class activation mapping, such as Grad-CAM of CNNs can show the importance of each region of the image related to the decision-making process of the model [28]. Others show error maps to highlight areas that might be erroneous [29] or use multiple maps for different VQA systems to generate a final heat map [30]. Furthermore, approaches have been designed to explicitly improve interpretability. ...

Generating and Evaluating Explanations of Attended and Error‐Inducing Input Regions for VQA Models

... This conjecture builds on two streams of research. One is research on AI-assisted decision making and problem solving, which recently started investigating how AI support can improve human learning, for example, in visual (e.g., Alipour et al., 2021;Goyal et al., 2019; or planning tasks . The other is decades-old research on multiple-cue learning (Harvey, 2012), which studies how and how well people make judgments and predictions about a target variable (e.g., a patient's future health outcome) based on multiple fallible cues (e.g., medical tests). ...

Improving Users’ Mental Model with Attention‐directed Counterfactual Edits

... For instance, Amazon [1] and Alibaba [2] use taxonomies in their e-commerce businesses to enhance the online shopping experience while Pinterest utilizes taxonomies for content recommendation and advertisement targetting [3,4]. Moreover, taxonomies find their usage in MeSH [5], Wikidata [6], Bloom's Taxonomy [7], WordNet [8], and DBPedia [9], which are employed to enhance information retrieval systems, enabling more accurate and efficient access to relevant data and knowledge across a range of fields. ...

Comprehension Based Question Answering using Bloom’s Taxonomy
  • Citing Conference Paper
  • January 2021

... One of the issues is that users often have difficulty understanding the explanations. In particular, the heatmaps could be generated too coarse or erroneous due to low illumination or complexity of the entity [14,23]. Another issue is that the way of presenting AI explanations could lead to cognitive bias in users, which leads to miscalibration of trust and degrades task performance [4,21,27]. ...

Knowing What VQA Does Not: Pointing to Error-Inducing Regions to Improve Explanation Helpfulness

... To enhance the interpretability of deep learning models, various methods have been developed. One such method is Grad-CAM [152], which highlights the gradients at a specific layer (feature map) of the network relative to the final output for a given input image. These techniques help identify the important regions of the input that most significantly contribute to the model's final decision. ...

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

International Journal of Computer Vision

... To better understand the decision mechanism of the model, this study employs the Gradient-weighted Class Activation Mapping (Grad-CAM) proposed by Selvaraju et al. [96]. Grad-CAM facilitates interpretability by generating heatmaps that highlight critical regions in the input image which most influence the model's predictions. ...

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization
  • Citing Conference Paper
  • October 2017