Xirong Li

Xirong Li
Renmin University of China | RUC · Key Lab of Data Engineering and Knowledge Engineering

PhD

About

203
Publications
42,027
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,264
Citations
Introduction
I am interested in image/video search. I received the Ph.D. degree from the University of Amsterdam (2012), the Master and Bachelor degrees from the Tsinghua University (2007 and 2005), all in computer science. I received SIGMM Best PhD Thesis Award 2013, IEEE Transactions on Multimedia Prize Paper Award 2012, Best Paper Award of ACM CIVR 2010, Best Paper runner-up of ACM ICMR 2012 and PCM 2014, and outstanding reviewer award of PCM 2014.
Additional affiliations
August 2022 - present
Renmin University of China
Position
  • Professor (Full)
August 2016 - August 2022
Renmin University of China
Position
  • Professor (Associate)
May 2012 - July 2016
Renmin University of China
Position
  • Professor (Assistant)

Publications

Publications (203)
Article
Full-text available
Social image analysis and retrieval is important for helping people organize and access the increasing amount of user tagged multimedia. Since user tagging is known to be uncontrolled, ambiguous, and overly personalized, a fundamental problem is how to interpret the relevance of a user-contributed tag with respect to the visual content the tag is d...
Article
Full-text available
This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequen...
Conference Paper
Full-text available
This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic...
Article
Full-text available
As manipulating images by copy-move, splicing and/or inpainting may lead to misinterpretation of the visual content, detecting these sorts of manipulations is crucial for media forensics. Given the variety of possible attacks on the content, devising a generic method is nontrivial. Current deep learning based methods are promising when training and...
Conference Paper
Full-text available
Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose a fully deep learning method for query representation learning. The proposed method requires no explicit concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ mode...
Preprint
Previous research on retinal vessel segmentation is targeted at a specific image domain, mostly color fundus photography (CFP). In this paper we make a brave attempt to attack a more challenging task of broad-domain retinal vessel segmentation (BD-RVS), which is to develop a unified model applicable to varied domains including CFP, SLO, UWF, OCTA a...
Preprint
Multimodal Large Language Models (MLLMs) are known to hallucinate, which limits their practical applications. Recent works have attempted to apply Direct Preference Optimization (DPO) to enhance the performance of MLLMs, but have shown inconsistent improvements in mitigating hallucinations. To address this issue more effectively, we introduce Hallu...
Preprint
Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-w...
Preprint
Hallucinations in multimodal large language models (MLLMs) hinder their practical applications. To address this, we propose a Magnifier Prompt (MagPrompt), a simple yet effective method to tackle hallucinations in MLLMs via extremely simple instructions. MagPrompt is based on the following two key principles, which guide the design of various effec...
Preprint
Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music...
Preprint
Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing...
Preprint
Full-text available
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only represent...
Conference Paper
Full-text available
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact, state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching, putting their scalabi...
Article
Full-text available
Artificial intelligence (AI)-based diagnostic systems have been reported to improve fundus disease screening in previous studies. This multicenter prospective self-controlled clinical trial aims to evaluate the diagnostic performance of a deep learning system (DLS) in assisting junior ophthalmologists in detecting 13 major fundus diseases. A total...
Conference Paper
Full-text available
We summarize our TRECVID 2023 Ad-hoc Video Search (AVS) experiments. We focus on leveraging pre-trained multimodal models for video and text representation. For video feature extraction, we utilized pre-trained models including BLIP, CLIP, irCSN, BEiT, WSL, Video-LLaMA, and BLIP-2. For text features, we employed BLIP, CLIP, and BLIP-2. Our approach...
Conference Paper
Full-text available
This paper introduces ChinaOpen, a dataset sourced from Bilibili, a popular Chinese video-sharing website, for open-world multi-modal learning. While the state-of-the-art multimodal learning networks have shown impressive performance in automated video annotation and cross-modal video retrieval, their training and evaluation are primarily conducted...
Conference Paper
Full-text available
This paper addresses the emerging task of recognizing multiple retinal diseases from wide-field (WF) and ultra-wide-field (UWF) fundus images. For an effective use of existing large amount of labeled color fundus photo (CFP) data and the relatively small amount of WF and UWF data, we propose a supervised domain adaptation method named Cross-domain...
Conference Paper
Full-text available
For homography estimation, we propose Geometrized Transformer (GeoFormer), a new detector-free feature matching method. Current detector-free methods, e.g. LoFTR, lack an effective mean to accurately localize small and thus computationally feasible regions for cross-attention diffusion. We resolve the challenge with an extremely simple idea: using...
Preprint
The most common type of lung cancer, lung adenocarcinoma (LUAD), has been increasingly detected since the advent of low-dose computed tomography screening technology. In clinical practice, pre-invasive LUAD (Pre-IAs) should only require regular follow-up care, while invasive LUAD (IAs) should receive immediate treatment with appropriate lung cancer...
Preprint
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for...
Preprint
This paper addresses the emerging task of recognizing multiple retinal diseases from wide-field (WF) and ultra-wide-field (UWF) fundus images. For an effective reuse of existing labeled color fundus photo (CFP) data, we propose Cross-domain Collaborative Learning (CdCL). Inspired by the success of fixed-ratio based mixup in unsupervised domain adap...
Preprint
This paper introduces ChinaOpen, a dataset sourced from Bilibili, a popular Chinese video-sharing website, for open-world multimodal learning. While the state-of-the-art multimodal learning networks have shown impressive performance in automated video annotation and cross-modal video retrieval, their training and evaluation have primarily been cond...
Preprint
We summarize our TRECVID 2022 Ad-hoc Video Search (AVS) experiments. Our solution is built with two new techniques, namely Lightweight Attentional Feature Fusion (LAFF) for combining diverse visual / textual features and Bidirectional Negation Learning (BNL) for addressing queries that contain negation cues. In particular, LAFF performs feature fus...
Article
Full-text available
The biased cues in the training data may build strong connections between specific targets and unexpected concepts, leading the learned representations could not be applied to real-world data that does not contain the same biased cues. To learn cross-bias representations which can generalize on unbiased datasets by only using biased data, researche...
Chapter
For retinal image matching (RIM), we propose SuperRetina, the first end-to-end method with jointly trainable keypoint detector and descriptor. SuperRetina is trained in a novel semi-supervised manner. A small set of (nearly 100) images are incompletely labeled and used to supervise the network to detect keypoints on the vascular tree. To attack the...
Chapter
In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the feat...
Chapter
Corneal nerve fiber medical indicators are promising metrics for diagnosis of diabetic peripheral neuropathy. However, automatic nerve segmentation still faces the issues of insufficient data and expensive annotations. We propose a semi-supervised learning framework for CCM image segmentation. It includes self-supervised pre-training, supervised fi...
Chapter
Fundus photograph is an important basis for ophthalmologists to diagnose retinal diseases. Due to the limitations of the optical system design for portable fundus cameras, there still exist typical image defects leading to low quality images. There are stray light defects such as atomization area, shadow ring, bright spot, central dark hole and so...
Chapter
When a typical wide field fundus camera takes two fundus images as a pair, two patterns of illumination beams (the top and bottom, left and right) are turned on respectively. Due to the influence of the illumination beams, the reflected and scattered light haze are observed in neighboring regions, which results in partial or even complete occlusion...
Preprint
Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a g...
Preprint
With a good code search engine, developers can reuse existing code snippets and accelerate software development process. Current code search methods can be divided into two categories: traditional information retrieval (IR) based and deep learning (DL) based approaches. DL-based approaches include the cross-encoder paradigm and the bi-encoder parad...
Preprint
Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the restriction of multi-head self-attention and GPU memory, there is a limit on the input token length. The existing pretrained code models, such as GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens b...
Article
This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sen...
Conference Paper
Full-text available
For retinal image matching (RIM), we propose SuperRetina, the first end-to-end method with jointly trainable keypoint detector and descriptor. SuperRetina is trained in a novel semi-supervised manner. A small set of (nearly 100) images are incompletely labeled and used to supervise the network to detect keypoints on the vascular tree. To attack the...
Preprint
For retinal image matching (RIM), we propose SuperRetina, the first end-to-end method with jointly trainable keypoint detector and descriptor. SuperRetina is trained in a novel semi-supervised manner. A small set of (nearly 100) images are incompletely labeled and used to supervise the network to detect keypoints on the vascular tree. To attack the...
Article
Full-text available
The daily practice of sharing images on social media raises a severe issue about privacy leakage. To address the issue, privacy-leaking image detection is studied recently, with the goal to automatically identify images that may leak privacy. Recent advance on this task benefits from focusing on crucial objects via pretrained object detectors and m...
Article
With the rapid progress of generation technology, it has become necessary to attribute the origin of fake images. Existing works on fake image attribution perform multi-class classification on several Generative Adversarial Network (GAN) models and obtain high accuracies. While encouraging, these works are restricted to model-level attribution, onl...
Article
Full-text available
Purpose: To develop deep learning models based on color fundus photographs that can automatically grade myopic maculopathy, diagnose pathologic myopia, and identify and segment myopia-related lesions. Methods: Photographs were graded and annotated by four ophthalmologists and were then divided into a high-consistency subgroup or a low-consistenc...
Article
Full-text available
Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a quick glance at the progress we have made, we attribute chal...
Preprint
Full-text available
Negation is a common linguistic skill that allows human to express what we do NOT want. Naturally, one might expect video retrieval to support natural-language queries with negation, e.g., finding shots of kids sitting on the floor and not playing with the dog. However, the state-of-the-art deep learning based video retrieval models lack such abili...
Article
Full-text available
This paper tackles automated categorization of Age-related Macular Degeneration (AMD), a common macular disease among people over 50. Previous research efforts mainly focus on AMD categorization with a single-modal input, let it be a color fundus photograph (CFP) or an OCT B-scan image. By contrast, we consider AMD categorization given a multi-moda...
Conference Paper
Full-text available
Over 300 million people worldwide are affected by various retinal diseases. By noninvasive Optical Coherence Tomography (OCT) scans, a number of abnormal structural changes in the retina, namely retinal lesions, can be identified. Automated lesion localization in OCT is thus important for detecting retinal diseases at their early stage. To conquer...
Preprint
Full-text available
Over 300 million people worldwide are affected by various retinal diseases. By noninvasive Optical Coherence Tomography (OCT) scans, a number of abnormal structural changes in the retina, namely retinal lesions, can be identified. Automated lesion localization in OCT is thus important for detecting retinal diseases at their early stage. To conquer...
Preprint
Unsupervised Domain Adaptation (UDA) is known to trade a model's performance on a source domain for improving its performance on a target domain. To resolve the issue, Unsupervised Domain Expansion (UDE) has been proposed recently to adapt the model for the target domain as UDA does, and in the meantime maintain its performance on the source domain...
Preprint
Full-text available
The daily practice of sharing images on social media raises a severe issue about privacy leakage. To address the issue, privacy-leaking image detection is studied recently, with the goal to automatically identify images that may leak privacy. Recent advance on this task benefits from focusing on crucial objects via pretrained object detectors and m...
Article
Full-text available
PurposeThe purpose of this study is to develop and validate the intelligent diagnosis of severe DR with lesion recognition based on color fundus photography.Methods The Kaggle public dataset for DR grading is used in the project, including 53,576 fundus photos in the test set, 28,101 in the training set, and 7,025 in the validation set. We randomly...
Preprint
With the rapid progress of generation technology, it has become necessary to attribute the origin of fake images. Existing works on fake image attribution perform multi-class classification on several Generative Adversarial Network (GAN) models and obtain high accuracies. While encouraging, these works are restricted to model-level attribution, onl...
Conference Paper
Full-text available
In this paper, we summarize our TRECVID 2021 experiments. We participated in two tasks: Ad-hoc Video Search (AVS) and Video-to-Text Description Generation (VTT). For the AVS task, we develop our solutions based on two cross-modal matching models, i.e. Sentence Encoder Assembly (SEA) and Multiple Encoder Assembly (MEA). MEA is a variant of SEA that...
Preprint
When a retrieval system expands data, its database is at risk of being attacked. In this paper, we introduce the concept of targeted Trojan-horse (TTH) attacks for language-based image retrieval (LBIR), the first keyword-wise targeted attack against the database of the retrieval system. Specifically, given a specific keyword, TTH generates a QR-cod...
Preprint
Full-text available
This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sen...
Article
Full-text available
Currently, existing state-of-the-art 3D object detectors are in two-stage paradigm. These methods typically comprise two steps: 1) Utilize a region proposal network to propose a handful of high-quality proposals in a bottom-up fashion. 2) Resize and pool the semantic features from the proposed regions to summarize RoI-wise representations for furth...
Preprint
Full-text available
False claims that have been previously fact-checked can still spread on social media. To mitigate their continual spread, detecting previously fact-checked claims is indispensable. Given a claim, existing works focus on providing evidence for detection by reranking candidate fact-checking articles (FC-articles) retrieved by BM25. However, these per...
Preprint
The key research question for image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity mostly ignored. In this paper we address both aspects by multi-view featur...
Preprint
In this paper, we revisit \emph{feature fusion}, an old-fashioned topic, in the new context of video retrieval by text. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of...
Conference Paper
Full-text available
For quantifying progress in Ad-hoc Video Search (AVS), the annual TRECVID AVS task is an important international evaluation. Solutions submitted by the task participants vary in terms of their choices of cross-modal matching models, visual features and training data. As such, what one may conclude from the evaluation is at a high level that is insu...
Conference Paper
Full-text available
The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning an...
Preprint
Full-text available
This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic...
Preprint
Full-text available
For quantifying progress in Ad-hoc Video Search (AVS), the annual TRECVID AVS task is an important international evaluation. Solutions submitted by the task participants vary in terms of their choices of cross-modal matching models, visual features and training data. As such, what one may conclude from the evaluation is at a high level that is insu...
Preprint
Recently, fake news with text and images have achieved more effective diffusion than text-only fake news, raising a severe issue of multimodal fake news detection. Current studies on this issue have made significant contributions to developing multimodal models, but they are defective in modeling the multimodal content sufficiently. Most of them on...
Conference Paper
Full-text available
This paper describes our bronze-medal solution for the video cap-tioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and caption decoding. For encoding, we propose to extract multi-level video features that describe holi...
Article
Full-text available
PurposeTo investigate the detection of lattice degeneration, retinal breaks, and retinal detachment in tessellated eyes using ultra-wide-field fundus imaging system (Optos) with convolutional neural network technology.Methods This study included 1500 Optos color images for tessellated fundus confirmation and peripheral retinal lesion (lattice degen...
Preprint
Full-text available
Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a quick glance at the progress we have made, we attribute chal...
Preprint
Full-text available
Rapid pace of generative models has brought about new threats to visual forensics such as malicious personation and digital copyright infringement, which promotes works on fake image attribution. Existing works on fake image attribution mainly rely on a direct classification framework. Without additional supervision, the extracted features could in...
Article
The cover image is based on the Research Article Deep learning‐based detection and stage grading for optimising diagnosis of diabetic retinopathy by Yuelin Wang et al., https://doi.org/10.1002/dmrr.3445.
Preprint
Full-text available
Currently, existing state-of-the-art 3D object detectors are in two-stage paradigm. These methods typically comprise two steps: 1) Utilize region proposal network to propose a fraction of high-quality proposals in a bottom-up fashion. 2) Resize and pool the semantic features from the proposed regions to summarize RoI-wise representations for furthe...
Preprint
Full-text available
The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning an...
Preprint
Full-text available
Expanding visual categorization into a novel domain without the need of extra annotation has been a long-term interest for multimedia intelligence. Previously, this challenge has been approached by unsupervised domain adaptation (UDA). Given labeled data from a source domain and unlabeled data from a target domain, UDA seeks for a deep representati...
Article
Aim To explore and evaluate an appropriate deep learning system (DLS) for the detection of 12 major fundus diseases using colour fundus photography. Methods Diagnostic performance of a DLS was tested on the detection of normal fundus and 12 major fundus diseases including referable diabetic retinopathy, pathologic myopic retinal degeneration, reti...