February 2025
·
1 Read
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
February 2025
·
1 Read
June 2024
·
3 Reads
·
8 Citations
June 2024
·
4 Reads
·
52 Citations
January 2024
·
1 Read
January 2024
·
2 Reads
·
27 Citations
September 2023
·
82 Reads
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency.
April 2023
·
34 Reads
We present a novel framework for probing and improving relational, compositional and contextual understanding of large visual-language models (V+L). While large V+L models have achieved success in various downstream tasks, it is not clear if they have a conceptual grasp of the content. We propose a novel benchmarking dataset for probing three aspects of content understanding. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We have experimented with 5 well known models, such as CLIP and ViLT, and found that they mostly fail to demonstrate a conceptual understanding. That said, we find interesting insights such as cross-attention helps learning conceptual understanding. We use these insights to propose a new finetuning technique that rewards the three conceptual understanding measures we proposed. We hope that the presented benchmarks will help the community assess and improve the conceptual understanding capabilities of large V+L models.
November 2021
·
32 Reads
·
7 Citations
Attention maps, a popular heatmap‐based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we propose Error Maps that clarify the error by highlighting image regions where the model is prone to err. Error maps can indicate when a correctly attended region may be processed incorrectly leading to an incorrect answer, and hence, improve users’ understanding of those cases. To evaluate our new explanations, we further introduce a metric that simulates users’ interpretation of explanations to evaluate their potential helpfulness to understand model correctness. We finally conduct user studies to see that our new explanations help users understand model correctness better than baselines by an expected 30% and that our proxy helpfulness metrics correlate strongly (𝜌 > 0.97) with how well users can predict model correctness.
November 2021
·
37 Reads
·
14 Citations
In the domain of Visual Question Answering (VQA), studies have shown improvement in users’ mental model of the VQA system when they are exposed to examples of how these systems answer certain Image-Question (IQ) pairs¹. In this work, we show that showing controlled counterfactual image-question examples are more effective at improving the mental model of users as compared to simply showing random examples. We compare a generative approach and a retrieval-based approach to show counterfactual examples. We use recent advances in generative adversarial networks (GANs) to generate counterfactual images by deleting and inpainting certain regions of interest in the image. We then expose users to changes in the VQA system’s answer on those altered images. To select the region of interest for inpainting, we experiment with using both human-annotated attention maps and a fully automatic method that uses the VQA system’s attention values. Finally, we test the user’s mental model by asking them to predict the model’s performance on a test counterfactual image. We note an overall improvement in users’ accuracy to predict answer change when shown counterfactual explanations. While realistic retrieved counterfactuals obviously are the most effective at improving the mental model, we show that a generative approach can also be equally effective.
October 2021
·
49 Reads
Despite their success and popularity, deep neural networks (DNNs) are vulnerable when facing backdoor attacks. This impedes their wider adoption, especially in mission critical applications. This paper tackles the problem of Trojan detection, namely, identifying Trojaned models -- models trained with poisoned data. One popular approach is reverse engineering, i.e., recovering the triggers on a clean image by manipulating the model's prediction. One major challenge of reverse engineering approach is the enormous search space of triggers. To this end, we propose innovative priors such as diversity and topological simplicity to not only increase the chances of finding the appropriate triggers but also improve the quality of the found triggers. Moreover, by encouraging a diverse set of trigger candidates, our method can perform effectively in cases with unknown target labels. We demonstrate that these priors can significantly improve the quality of the recovered triggers, resulting in substantially improved Trojan detection accuracy as validated on both synthetic and publicly available TrojAI benchmarks.
... The rapid progress in Large Language Models (LLMs) has significantly boosted AI capabilities, particularly in text generation and complex reasoning [7,14,17,50,51,53,67,68,75]. Expanding these models into multimodal applications, Multimodal LLMs (MLLMs) have shown strong performance across image and short video tasks like captioning, question answering, and segmentation [10,11,16,34,35,37,57,58,62,72,93]. However, as video content lengthens, these models encounter significant challenges due to context length limitations, which restrict their ability to process multiple frames and capture complex, extended temporal interactions. ...
June 2024
... To improve the ability of LVLMs to resist malicious queries, recent research has explored various safety alignment strategies [10], [11], [19]. A straightforward approach is to train LVLMs to refuse harmful queries using methods such as Reinforcement Learning from Human Feedback (RLHF) [12], [13], [32] or Supervised Fine-Tuning (SFT) [8], [33], [34], with the goal of reinforcing safe behaviors across diverse inputs. While such approaches can be effective, they require substantial human annotation and computational resources, which limits their scalability and adaptability. ...
June 2024
... Reasoning in LVLMs: Inspired by the success of CoT prompting and training in LLMs, several works (Cheng et al., 2024;Chen et al., 2024b;Shen et al., 2025) have made progress in boosting LVLM performance by incorporating curated CoT data during training. Alibaba released QVQ (Qwen Team, 2024a), a reasoning LVLM along the lines of QwQ (Qwen Team, 2024a) and trained via an RL-based approach. ...
January 2024
... Diversity [20,21] and controllability [12,14] are two key attributes that have received widespread attention in previous IC research. Recent findings indicate that titles generated by supervised methods tend to be more generalized, capturing the most common language patterns and words from the training corpus, a phenomenon referred to as the pattern collapse issue. ...
April 2018
Proceedings of the AAAI Conference on Artificial Intelligence
... Class activation mapping, such as Grad-CAM of CNNs can show the importance of each region of the image related to the decision-making process of the model [28]. Others show error maps to highlight areas that might be erroneous [29] or use multiple maps for different VQA systems to generate a final heat map [30]. Furthermore, approaches have been designed to explicitly improve interpretability. ...
November 2021
... This conjecture builds on two streams of research. One is research on AI-assisted decision making and problem solving, which recently started investigating how AI support can improve human learning, for example, in visual (e.g., Alipour et al., 2021;Goyal et al., 2019; or planning tasks . The other is decades-old research on multiple-cue learning (Harvey, 2012), which studies how and how well people make judgments and predictions about a target variable (e.g., a patient's future health outcome) based on multiple fallible cues (e.g., medical tests). ...
November 2021
... For instance, Amazon [1] and Alibaba [2] use taxonomies in their e-commerce businesses to enhance the online shopping experience while Pinterest utilizes taxonomies for content recommendation and advertisement targetting [3,4]. Moreover, taxonomies find their usage in MeSH [5], Wikidata [6], Bloom's Taxonomy [7], WordNet [8], and DBPedia [9], which are employed to enhance information retrieval systems, enabling more accurate and efficient access to relevant data and knowledge across a range of fields. ...
January 2021
... One of the issues is that users often have difficulty understanding the explanations. In particular, the heatmaps could be generated too coarse or erroneous due to low illumination or complexity of the entity [14,23]. Another issue is that the way of presenting AI explanations could lead to cognitive bias in users, which leads to miscalibration of trust and degrades task performance [4,21,27]. ...
March 2021
... To enhance the interpretability of deep learning models, various methods have been developed. One such method is Grad-CAM [152], which highlights the gradients at a specific layer (feature map) of the network relative to the final output for a given input image. These techniques help identify the important regions of the input that most significantly contribute to the model's final decision. ...
February 2020
International Journal of Computer Vision
... To better understand the decision mechanism of the model, this study employs the Gradient-weighted Class Activation Mapping (Grad-CAM) proposed by Selvaraju et al. [96]. Grad-CAM facilitates interpretability by generating heatmaps that highlight critical regions in the input image which most influence the model's predictions. ...
October 2017