Junhao Yin’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (1)


Figure 1: An example of contextual image reference, where referencing the images of a Brachiosaurus can largely enhance user comprehension and engagement. 1
Figure 2: The training strategy of the proposed IMI-VL model. Stage 1: Training dataset construction involves generating textual responses and image descriptions through a language model and a vision-language model. These are combined into interleaved responses using image contexts and captions. Stage 2: Supervised fine-tuning refines the model with a vision encoder, adapter, and language model, optimizing through generative loss.
Figure 3: Human evaluation score distribution of four methods.
Statistic details of datasets used in our experi- ments.
ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models
  • Preprint
  • File available

January 2025

·

8 Reads

Jingwei Yi

·

Junhao Yin

·

Ju Xu

·

[...]

·

Hao Wang

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at https://github.com/bytedance/ImageRef-VL.

Download