Mubarak Shah’s research while affiliated with University of Central Florida and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (675)


LLM Post-Training: A Deep Dive into Reasoning Large Language Models
  • Preprint
  • File available

February 2025

·

93 Reads

Komal Kumar

·

Tajamul Ashraf

·

Omkar Thawakar

·

[...]

·

Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.

Download

Figure 2. An illustration of the conventional n-frame video M-LLM framework and our video M-LLM framework with frame selection.
Figure 3. An illustration of the spatial and temporal pseudo labeling for the importance scores
Figure 5. One visualization example of the frame selection results.
Figure 6. One visualization example of the frame selection results.
Figure 7. One visualization example of the frame selection results.

+1

M-LLM Based Video Frame Selection for Efficient Video Understanding

February 2025

·

28 Reads

Recent advances in Multi-Modal Large Language Models (M-LLMs) show promising results in video reasoning. Popular Multi-Modal Large Language Model (M-LLM) frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an M-LLM, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream M-LLM may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight M-LLM -based frame selection method that adaptively select frames that are more relevant to users' queries. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a M-LLM; (ii) Temporal signal, in which multiple frames selection by prompting Large Language Model (LLM) using the captions of all frame candidates. The selected frames are then digested by a frozen downstream video M-LLM for visual reasoning and question answering. Empirical results show that the proposed M-LLM video frame selector improves the performances various downstream video Large Language Model (video-LLM) across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.


VLDBench: Vision Language Models Disinformation Detection Benchmark

February 2025

·

20 Reads

The rapid rise of AI-generated content has made detecting disinformation increasingly challenging. In particular, multimodal disinformation, i.e., online posts-articles that contain images and texts with fabricated information are specially designed to deceive. While existing AI safety benchmarks primarily address bias and toxicity, multimodal disinformation detection remains largely underexplored. To address this challenge, we present the Vision-Language Disinformation Detection Benchmark VLDBench, the first comprehensive benchmark for detecting disinformation across both unimodal (text-only) and multimodal (text and image) content, comprising 31,000} news article-image pairs, spanning 13 distinct categories, for robust evaluation. VLDBench features a rigorous semi-automated data curation pipeline, with 22 domain experts dedicating 300 plus hours} to annotation, achieving a strong inter-annotator agreement (Cohen kappa = 0.78). We extensively evaluate state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs), demonstrating that integrating textual and visual cues in multimodal news posts improves disinformation detection accuracy by 5 - 35 % compared to unimodal models. Developed in alignment with AI governance frameworks such as the EU AI Act, NIST guidelines, and the MIT AI Risk Repository 2024, VLDBench is expected to become a benchmark for detecting disinformation in online multi-modal contents. Our code and data will be publicly available.


SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models

February 2025

·

3 Reads

Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to stereotypes, harmful generations, and ambiguous assumptions in real-world scenarios has become essential. However, existing datasets evaluating stereotype biases in LMMs often lack diversity and rely on synthetic images, leaving a gap in bias evaluation for real-world visual contexts. To address this, we introduce the Stereotype Bias Benchmark (SB-bench), the most comprehensive framework to date for assessing stereotype biases across nine diverse categories with non-synthetic images. SB-bench rigorously evaluates LMMs through carefully curated, visually grounded scenarios, challenging them to reason accurately about visual stereotypes. It offers a robust evaluation framework featuring real-world visual samples, image variations, and multiple-choice question formats. By introducing visually grounded queries that isolate visual biases from textual ones, SB-bench enables a precise and nuanced assessment of a model's reasoning capabilities across varying levels of difficulty. Through rigorous testing of state-of-the-art open-source and closed-source LMMs, SB-bench provides a systematic approach to assessing stereotype biases in LMMs across key social dimensions. This benchmark represents a significant step toward fostering fairness in AI systems and reducing harmful biases, laying the groundwork for more equitable and socially responsible LMMs. Our code and dataset are publicly available.



ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition

January 2025

Bias in machine learning models can lead to unfair decision making, and while it has been well-studied in the image and text domains, it remains underexplored in action recognition. Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance), which can be detrimental to real-life applications such as autonomous vehicles or assisted living monitoring. While prior approaches have mainly focused on mitigating background bias using specialized augmentations, we thoroughly study both biases. We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes. Our framework applies an adversarial cross-entropy loss to the sampled static clip (where all the frames are the same) and aims to make its class probabilities uniform using a proposed entropy maximization loss. Additionally, we introduce a gradient penalty loss for regularization against the debiasing process. We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% on HMDB51. Furthermore, we identify an issue of background leakage in the existing UCF101 protocol for bias evaluation which provides a shortcut to predict actions and does not provide an accurate measure of the debiasing capability of a model. We address this issue by proposing more fine-grained segmentation boundaries for the actor, where our method also outperforms existing approaches. Project Page: https://joefioresi718.github.io/ALBAR_webpage/


Figure 3. Baseline comparison for multiple-choice TLQA. We provide blank frames to SeViLA as a baseline to evaluate the performance on multiple-choice TLQA benchmark. MC: Multiple-Choice.
Figure 4. Qualitative Results for Multiple-Choice QA: It shows that in some cases, scene or object information might correlate with the correct answer, thus resulting in a easier setup compared to binary QA.
Overview of temporal operators syntax. M indicates the temporal model, t indicates time.
Zero-Shot Results on Boolean QA TLQA-S over four datasets. The models are: VL: VideoLLaVA [13], VCG: VideoChatGPT [15], CV: ChatUniVi [4]. It shows that binary QA performance is in general closer to the random base performance than multiple choice QA, indicating a harder task.
TimeLogic: A Temporal Logic Benchmark for Video QA

January 2025

·

6 Reads

Temporal logical understanding, a core facet of human cognition, plays a pivotal role in capturing complex sequential events and their temporal relationships within videos. This capability is particularly crucial in tasks like Video Question Answering (VideoQA), where the goal is to process visual data over time together with textual data to provide coherent answers. However, current VideoQA benchmarks devote little focus to evaluating this critical skill due to the challenge of annotating temporal logic. Despite the advancement of vision-language models, assessing their temporal logical reasoning powers remains a challenge, primarily due to the lack QA pairs that demand formal, complex temporal reasoning. To bridge this gap, we introduce the TimeLogic QA (TLQA) framework to automatically generate the QA pairs, specifically designed to evaluate the temporal logical understanding. To this end, TLQA leverages temporal annotations from existing video datasets together with temporal operators derived from logic theory to construct questions that test understanding of event sequences and their temporal relationships. TLQA framework is generic and scalable, capable of leveraging both, existing video action datasets with temporal action segmentation annotations, or video datasets with temporal scene graph annotations, to automatically generate temporal logical questions. We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate two VideoQA dataset variants - small (TLQA-S) and large (TLQA-L) - containing 2k and 10k QA pairs for each category, resulting in 32k and 160k total pairs per dataset. We undertake a comprehensive evaluation of leading-edge VideoQA models, employing the TLQA to benchmark their temporal logical understanding capabilities. We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.


Figure 4: The comprehensive comparison of category-wise and overall performance scores achieved by various models on diverse reasoning tasks. The evaluation spans multiple domains, including Math & Logic Reasoning, Scientific Reasoning, Complex Visual Perception, Chart & Diagram Understanding, Medical Imaging, Social & Cultural Context, Visual Reasoning, and OCR & Document Understanding. The models assessed include GPT-4o, Claude-3.5-Sonnet, Gemini variants, LLAVA-CoT, and our proposed model. Our model demonstrates consistently superior performance in critical categories such as Math & Logic Reasoning, Chart & Diagram Understanding, and Medical Imaging, achieving a balanced improvement across both step-by-step reasoning (Step Scores) and final answer accuracy (Final Answer Scores). Compared to LLAVA-CoT, our approach excels in maintaining high accuracy across tasks while showcasing robustness and interpretability in multi-step reasoning challenges.
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

January 2025

·

31 Reads

Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.


Foundation Models Defining a New Era in Vision: A Survey and Outlook

January 2025

·

46 Reads

·

57 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundation models studied in this work is available at https://github.com/awaisrauf/Awesome-CV-Foundational-Models.


LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

December 2024

·

10 Reads

Many existing jailbreak techniques rely on solving discrete combinatorial optimization, while more recent approaches involve training LLMs to generate multiple adversarial prompts. However, both approaches require significant computational resources to produce even a single adversarial prompt. We hypothesize that the inefficiency of current approaches stems from an inadequate characterization of the jailbreak problem. To address this gap, we formulate the jailbreak problem in terms of alignment. By starting from an available safety-aligned model, we leverage an unsafe reward to guide the safe model towards generating unsafe outputs using alignment techniques (e.g., reinforcement learning from human feedback), effectively performing jailbreaking via alignment. We propose a novel jailbreak method called LIAR (LeveragIng Alignment to jailbReak). To demonstrate the simplicity and effectiveness of our approach, we employ a best-of-N method to solve the alignment problem. LIAR offers significant advantages: lower computational requirements without additional training, fully black-box operation, competitive attack success rates, and more human-readable prompts. We provide theoretical insights into the possibility of jailbreaking a safety-aligned model, revealing inherent vulnerabilities in current alignment strategies for LLMs. We also provide sub-optimality guarantees for the proposed \algo. Experimentally, we achieve ASR comparable to the SoTA with a 10x improvement to perplexity and a Time-to-Attack measured in seconds rather than tens of hours.


Citations (44)


... Foundation Vision Models. Recently, foundation vision backbones [6,29] have been successfully applied to various challenging downstream tasks [3,14,17,22], yielding remarkable results. In contrast to traditional supervised training, these backbones typically employ self-supervision or weak supervision schemes, such as masked image modeling [44], contrastive learning [8] or self distillation, reducing the reliance on labels and enhancing the quality of representation. ...

Reference:

Interpretable Image Classification via Non-parametric Part Prototype Learning
Foundation Models Defining a New Era in Vision: A Survey and Outlook
  • Citing Article
  • January 2025

IEEE Transactions on Pattern Analysis and Machine Intelligence

... In real-world scenarios, even when the target environment differs significantly from training and includes human interference, the robot can still effectively achieve its goals through multimodal behavior. In the natural domain action segmentation task, Liu et al. proposed DiffAct [28] and Dif-fAct++ [29], which transform the action segmentation process into a gradual denoising process for refinement. This process introduces stochastic feature information, improving performance across multiple datasets. ...

DiffAct++: Diffusion Action Segmentation
  • Citing Article
  • November 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... They proposed an end-to-end temporal feature aggregation method for Sequence-CVGL. Pillai et al. [28] proposed a fully transformed-based CVGL method to efficiently aggregate image-level representations and adapt it for video inputs. Although these works have advanced CVGL, ground images in these datasets exhibit high content redundancy, and the methods are heavily reliant on sequential relationships. ...

GAReT: Cross-View Video Geolocalization with Adapters and Auto-Regressive Transformers
  • Citing Chapter
  • November 2024

... Contrastive learning techniques have shown strong performances in mitigating the variability across high-to-low-cost devices, Dave et al. [82] show this leveraging domain adaptive contrastive loss as part of the training procedure. ...

Codamal: Contrastive Domain Adaptation for Malaria Detection in Low-Cost Microscopes
  • Citing Conference Paper
  • October 2024

... 3D Vision-Language Learning Recent advancements in 3D vision-language (3D-VL) learning [Chen et al., 2020;Achlioptas et al., 2020;Azuma et al., 2022;Kang et al., 2024b] have focused on bridging the gap between 3D scene understanding and natural language, which is essential for developing embodied intelligence. Similar to 2D vision-language learning [Kazemzadeh et al., 2014;Kang et al., 2024c;Johnson et al., 2016;You et al., 2023;Kang et al., 2024a;Antol et al., 2015;Kang et al., 2024d], a variety of tasks such as 3D Visual Grounding [Chen et al., 2020;Achlioptas et al., 2020], 3D Dense Captioning , and 3D Question Answering [Azuma et al., 2022;Ma et al., 2022] have been proposed to evaluate the ability of human instruction following in relation to 3D object properties and spatial relationships. Initial efforts focus on building a task-specific model for a single tasks, such as EDA for grounding and Vote2Cap-DETR++ [Chen et al., 2024b] for captioning. ...

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
  • Citing Chapter
  • October 2024

... It maintains a surrogate model of the performance based on past evaluations of configurations, which guides the choice of promising configurations to evaluate. Recent studies on BO have explored expert priors [11,20,26,29], derivative information [1,27,35], and enhancing the interpretability [5,[36][37][38][39] of HPO and NAS [3,24,25]. ...

Regulating Model Reliance on Non-robust Features by Smoothing Input Marginal Density
  • Citing Chapter
  • September 2024

... In the video domain, recent studies have enhanced cross-modal interactions by utilizing additional data, such as captions derived from videos. CoVR [31] improved the composed video retrieval performance by using cross-attention between videos and generated captions, and Cap4Video [39] proposed a co-attention framework using single-sentence captions summarizing entire videos. Nonetheless, these studies primarily focus on global video content, often overlooking local details. ...

Composed Video Retrieval via Enriched Context and Discriminative Embeddings
  • Citing Conference Paper
  • June 2024

... Dual-focus visual captioning. Existing video automatic captioning strategies [4,6,47] only use image captioners like BLIP [27] and GPT-4V [40] to describe uniformly sampled frames, lacking the awareness of temporal dynamic knowledge in videos. In contrast, we focus on both spatial visual details and significant dynamic information (i.e., actions and camera movements), and also consider the complexities of long video events. ...

VidLA: Video-Language Alignment at Scale
  • Citing Conference Paper
  • June 2024

... In contrast, individuals with milder symptoms emphasize the technical trustworthiness of GenAI chatbots, focusing on the technical limitations of LLM models, such as hallucination, accuracy issues, memory constraints, and decision-making processes. While existing research extensively covers the technical challenges and trustworthiness concerns of GenAI and LLMs, including issues like hallucination and model limitations [4,40,76,119], our study contextualizes these challenges specifically within the context of SA. We seek to explore how these individuals perceive and react to these challenges when considering GenAI chatbots as a tool for managing their condition, and the ways in which these technical concerns shape their trust and willingness to use GenAI chatbots. ...

Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects