The training strategy of the proposed IMI-VL model. Stage 1: Training dataset construction involves generating textual responses and image descriptions through a language model and a vision-language model. These are combined into interleaved responses using image contexts and captions. Stage 2: Supervised fine-tuning refines the model with a vision encoder, adapter, and language model, optimizing through generative loss.

The training strategy of the proposed IMI-VL model. Stage 1: Training dataset construction involves generating textual responses and image descriptions through a language model and a vision-language model. These are combined into interleaved responses using image contexts and captions. Stage 2: Supervised fine-tuning refines the model with a vision encoder, adapter, and language model, optimizing through generative loss.

Source publication
Preprint
Full-text available
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in refere...

Context in source publication

Context 1
... preserve the general VLM capabilities, we combine the original SFT data with interleaved multi-image SFT data. The ImageRef-VL framework is illustrated in Figure 2. information related to the image described in the text. ...

Similar publications

Article
Full-text available
Background and aim Artificial intelligence (AI)-powered chatbots, such as Chat Generative Pretrained Transformer (ChatGPT), have shown promising results in healthcare settings. These tools can help patients obtain real-time responses to queries, ensuring immediate access to relevant information. The study aimed to explore the potential use of ChatG...
Preprint
Full-text available
Recent advancements in large language models (LLMs) have facilitated a wide range of applications with distinct quality-of-experience requirements, from latency-sensitive online tasks, such as interactive chatbots, to throughput-focused offline tasks like document summarization. While deploying dedicated machines for these services ensures high-qua...
Article
Full-text available
A OpenIA é uma empresa que vem se destacando no desenvolvimento da Inteligência Artificial para chatbots na atualidade, criando seu primeiro modelo chamado de GPT, que vem evoluindo a cada dia, trazendo novas versões como o GPT-3.5 e GPT-4. Essa evolução passou de uma Inteligência Artificial que processava somente texto para uma Inteligência Artifi...
Article
Full-text available
Purpose The global prevalence of vaccine misinformation has underscored the crucial necessity to combat false information and explore innovative solutions like chatbots. These artificial intelligence (AI)-powered tools play a pivotal role in disseminating accurate information and mitigating the adverse effects of misinformation. This study aimed to...
Preprint
Full-text available
User interactions with conversational agents (CAs) evolve in the era of heavily guardrailed large language models (LLMs). As users push beyond programmed boundaries to explore and build relationships with these systems, there is a growing concern regarding the potential for unauthorized access or manipulation, commonly referred to as "jailbreaking....