About
203
Publications
42,027
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,264
Citations
Introduction
I am interested in image/video search. I received the Ph.D. degree from the University of Amsterdam (2012), the Master and Bachelor degrees from the Tsinghua University (2007 and 2005), all in computer science. I received SIGMM Best PhD Thesis Award 2013, IEEE Transactions on Multimedia Prize Paper Award 2012, Best Paper Award of ACM CIVR 2010, Best Paper runner-up of ACM ICMR 2012 and PCM 2014, and outstanding reviewer award of PCM 2014.
Skills and Expertise
Additional affiliations
August 2022 - present
August 2016 - August 2022
May 2012 - July 2016
Publications
Publications (203)
Social image analysis and retrieval is important for helping people organize and access the increasing amount of user tagged multimedia. Since user tagging is known to be uncontrolled, ambiguous, and overly personalized, a fundamental problem is how to interpret the relevance of a user-contributed tag with respect to the visual content the tag is d...
This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequen...
This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic...
As manipulating images by copy-move, splicing and/or inpainting may lead to misinterpretation of the visual content, detecting these sorts of manipulations is crucial for media forensics. Given the variety of possible attacks on the content, devising a generic method is nontrivial. Current deep learning based methods are promising when training and...
Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose a fully deep learning method for query representation learning. The proposed method requires no explicit concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ mode...
Previous research on retinal vessel segmentation is targeted at a specific image domain, mostly color fundus photography (CFP). In this paper we make a brave attempt to attack a more challenging task of broad-domain retinal vessel segmentation (BD-RVS), which is to develop a unified model applicable to varied domains including CFP, SLO, UWF, OCTA a...
Multimodal Large Language Models (MLLMs) are known to hallucinate, which limits their practical applications. Recent works have attempted to apply Direct Preference Optimization (DPO) to enhance the performance of MLLMs, but have shown inconsistent improvements in mitigating hallucinations. To address this issue more effectively, we introduce Hallu...
Video-text retrieval has seen significant advancements, yet the ability of models to discern subtle differences in captions still requires verification. In this paper, we introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions with subtle single-w...
Hallucinations in multimodal large language models (MLLMs) hinder their practical applications. To address this, we propose a Magnifier Prompt (MagPrompt), a simple yet effective method to tackle hallucinations in MLLMs via extremely simple instructions. MagPrompt is based on the following two key principles, which guide the design of various effec...
Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music...
Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing...
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only represent...
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact, state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching, putting their scalabi...
Artificial intelligence (AI)-based diagnostic systems have been reported to improve fundus disease screening in previous studies. This multicenter prospective self-controlled clinical trial aims to evaluate the diagnostic performance of a deep learning system (DLS) in assisting junior ophthalmologists in detecting 13 major fundus diseases. A total...
We summarize our TRECVID 2023 Ad-hoc Video Search (AVS) experiments. We focus on leveraging pre-trained multimodal models for video and text representation. For video feature extraction, we utilized pre-trained models including BLIP, CLIP, irCSN, BEiT, WSL, Video-LLaMA, and BLIP-2. For text features, we employed BLIP, CLIP, and BLIP-2. Our approach...
This paper introduces ChinaOpen, a dataset sourced from Bilibili, a popular Chinese video-sharing website, for open-world multi-modal learning. While the state-of-the-art multimodal learning networks have shown impressive performance in automated video annotation and cross-modal video retrieval, their training and evaluation are primarily conducted...
This paper addresses the emerging task of recognizing multiple retinal diseases from wide-field (WF) and ultra-wide-field (UWF) fundus images. For an effective use of existing large amount of labeled color fundus photo (CFP) data and the relatively small amount of WF and UWF data, we propose a supervised domain adaptation method named Cross-domain...
For homography estimation, we propose Geometrized Transformer (GeoFormer), a new detector-free feature matching method. Current detector-free methods, e.g. LoFTR, lack an effective mean to accurately localize small and thus computationally feasible regions for cross-attention diffusion. We resolve the challenge with an extremely simple idea: using...
The most common type of lung cancer, lung adenocarcinoma (LUAD), has been increasingly detected since the advent of low-dose computed tomography screening technology. In clinical practice, pre-invasive LUAD (Pre-IAs) should only require regular follow-up care, while invasive LUAD (IAs) should receive immediate treatment with appropriate lung cancer...
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for...
This paper addresses the emerging task of recognizing multiple retinal diseases from wide-field (WF) and ultra-wide-field (UWF) fundus images. For an effective reuse of existing labeled color fundus photo (CFP) data, we propose Cross-domain Collaborative Learning (CdCL). Inspired by the success of fixed-ratio based mixup in unsupervised domain adap...
This paper introduces ChinaOpen, a dataset sourced from Bilibili, a popular Chinese video-sharing website, for open-world multimodal learning. While the state-of-the-art multimodal learning networks have shown impressive performance in automated video annotation and cross-modal video retrieval, their training and evaluation have primarily been cond...
We summarize our TRECVID 2022 Ad-hoc Video Search (AVS) experiments. Our solution is built with two new techniques, namely Lightweight Attentional Feature Fusion (LAFF) for combining diverse visual / textual features and Bidirectional Negation Learning (BNL) for addressing queries that contain negation cues. In particular, LAFF performs feature fus...
The biased cues in the training data may build strong connections between specific targets and unexpected concepts, leading the learned representations could not be applied to real-world data that does not contain the same biased cues. To learn cross-bias representations which can generalize on unbiased datasets by only using biased data, researche...
For retinal image matching (RIM), we propose SuperRetina, the first end-to-end method with jointly trainable keypoint detector and descriptor. SuperRetina is trained in a novel semi-supervised manner. A small set of (nearly 100) images are incompletely labeled and used to supervise the network to detect keypoints on the vascular tree. To attack the...
In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the feat...
Corneal nerve fiber medical indicators are promising metrics for diagnosis of diabetic peripheral neuropathy. However, automatic nerve segmentation still faces the issues of insufficient data and expensive annotations. We propose a semi-supervised learning framework for CCM image segmentation. It includes self-supervised pre-training, supervised fi...
Fundus photograph is an important basis for ophthalmologists to diagnose retinal diseases. Due to the limitations of the optical system design for portable fundus cameras, there still exist typical image defects leading to low quality images. There are stray light defects such as atomization area, shadow ring, bright spot, central dark hole and so...
When a typical wide field fundus camera takes two fundus images as a pair, two patterns of illumination beams (the top and bottom, left and right) are turned on respectively. Due to the influence of the illumination beams, the reflected and scattered light haze are observed in neighboring regions, which results in partial or even complete occlusion...
Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a g...
With a good code search engine, developers can reuse existing code snippets and accelerate software development process. Current code search methods can be divided into two categories: traditional information retrieval (IR) based and deep learning (DL) based approaches. DL-based approaches include the cross-encoder paradigm and the bi-encoder parad...
Thanks to the Transformer-based pretraining models, the performance of code search has been improved significantly. However, due to the restriction of multi-head self-attention and GPU memory, there is a limit on the input token length. The existing pretrained code models, such as GraphCodeBERT, CodeBERT, RoBERTa (code), take the first 256 tokens b...
This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sen...
For retinal image matching (RIM), we propose SuperRetina, the first end-to-end method with jointly trainable keypoint detector and descriptor. SuperRetina is trained in a novel semi-supervised manner. A small set of (nearly 100) images are incompletely labeled and used to supervise the network to detect keypoints on the vascular tree. To attack the...
For retinal image matching (RIM), we propose SuperRetina, the first end-to-end method with jointly trainable keypoint detector and descriptor. SuperRetina is trained in a novel semi-supervised manner. A small set of (nearly 100) images are incompletely labeled and used to supervise the network to detect keypoints on the vascular tree. To attack the...
The daily practice of sharing images on social media raises a severe issue about privacy leakage. To address the issue, privacy-leaking image detection is studied recently, with the goal to automatically identify images that may leak privacy. Recent advance on this task benefits from focusing on crucial objects via pretrained object detectors and m...
With the rapid progress of generation technology, it has become necessary to attribute the origin of fake images. Existing works on fake image attribution perform multi-class classification on several Generative Adversarial Network (GAN) models and obtain high accuracies. While encouraging, these works are restricted to model-level attribution, onl...
Purpose:
To develop deep learning models based on color fundus photographs that can automatically grade myopic maculopathy, diagnose pathologic myopia, and identify and segment myopia-related lesions.
Methods:
Photographs were graded and annotated by four ophthalmologists and were then divided into a high-consistency subgroup or a low-consistenc...
Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a quick glance at the progress we have made, we attribute chal...
Negation is a common linguistic skill that allows human to express what we do NOT want. Naturally, one might expect video retrieval to support natural-language queries with negation, e.g., finding shots of kids sitting on the floor and not playing with the dog. However, the state-of-the-art deep learning based video retrieval models lack such abili...
This paper tackles automated categorization of Age-related Macular Degeneration (AMD), a common macular disease among people over 50. Previous research efforts mainly focus on AMD categorization with a single-modal input, let it be a color fundus photograph (CFP) or an OCT B-scan image. By contrast, we consider AMD categorization given a multi-moda...
Over 300 million people worldwide are affected by various retinal diseases. By noninvasive Optical Coherence Tomography (OCT) scans, a number of abnormal structural changes in the retina, namely retinal lesions, can be identified. Automated lesion localization in OCT is thus important for detecting retinal diseases at their early stage. To conquer...
Over 300 million people worldwide are affected by various retinal diseases. By noninvasive Optical Coherence Tomography (OCT) scans, a number of abnormal structural changes in the retina, namely retinal lesions, can be identified. Automated lesion localization in OCT is thus important for detecting retinal diseases at their early stage. To conquer...
Unsupervised Domain Adaptation (UDA) is known to trade a model's performance on a source domain for improving its performance on a target domain. To resolve the issue, Unsupervised Domain Expansion (UDE) has been proposed recently to adapt the model for the target domain as UDA does, and in the meantime maintain its performance on the source domain...
The daily practice of sharing images on social media raises a severe issue about privacy leakage. To address the issue, privacy-leaking image detection is studied recently, with the goal to automatically identify images that may leak privacy. Recent advance on this task benefits from focusing on crucial objects via pretrained object detectors and m...
PurposeThe purpose of this study is to develop and validate the intelligent diagnosis of severe DR with lesion recognition based on color fundus photography.Methods
The Kaggle public dataset for DR grading is used in the project, including 53,576 fundus photos in the test set, 28,101 in the training set, and 7,025 in the validation set. We randomly...
With the rapid progress of generation technology, it has become necessary to attribute the origin of fake images. Existing works on fake image attribution perform multi-class classification on several Generative Adversarial Network (GAN) models and obtain high accuracies. While encouraging, these works are restricted to model-level attribution, onl...
In this paper, we summarize our TRECVID 2021 experiments. We participated in two tasks: Ad-hoc Video Search (AVS) and Video-to-Text Description Generation (VTT). For the AVS task, we develop our solutions based on two cross-modal matching models, i.e. Sentence Encoder Assembly (SEA) and Multiple Encoder Assembly (MEA). MEA is a variant of SEA that...
When a retrieval system expands data, its database is at risk of being attacked. In this paper, we introduce the concept of targeted Trojan-horse (TTH) attacks for language-based image retrieval (LBIR), the first keyword-wise targeted attack against the database of the retrieval system. Specifically, given a specific keyword, TTH generates a QR-cod...
This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sen...
Currently, existing state-of-the-art 3D object detectors are in two-stage paradigm. These methods typically comprise two steps: 1) Utilize a region proposal network to propose a handful of high-quality proposals in a bottom-up fashion. 2) Resize and pool the semantic features from the proposed regions to summarize RoI-wise representations for furth...
False claims that have been previously fact-checked can still spread on social media. To mitigate their continual spread, detecting previously fact-checked claims is indispensable. Given a claim, existing works focus on providing evidence for detection by reranking candidate fact-checking articles (FC-articles) retrieved by BM25. However, these per...
The key research question for image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity mostly ignored. In this paper we address both aspects by multi-view featur...
In this paper, we revisit \emph{feature fusion}, an old-fashioned topic, in the new context of video retrieval by text. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of...
For quantifying progress in Ad-hoc Video Search (AVS), the annual TRECVID AVS task is an important international evaluation. Solutions submitted by the task participants vary in terms of their choices of cross-modal matching models, visual features and training data. As such, what one may conclude from the evaluation is at a high level that is insu...
The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning an...
This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic...
For quantifying progress in Ad-hoc Video Search (AVS), the annual TRECVID AVS task is an important international evaluation. Solutions submitted by the task participants vary in terms of their choices of cross-modal matching models, visual features and training data. As such, what one may conclude from the evaluation is at a high level that is insu...
Recently, fake news with text and images have achieved more effective diffusion than text-only fake news, raising a severe issue of multimodal fake news detection. Current studies on this issue have made significant contributions to developing multimodal models, but they are defective in modeling the multimodal content sufficiently. Most of them on...
This paper describes our bronze-medal solution for the video cap-tioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and caption decoding. For encoding, we propose to extract multi-level video features that describe holi...
PurposeTo investigate the detection of lattice degeneration, retinal breaks, and retinal detachment in tessellated eyes using ultra-wide-field fundus imaging system (Optos) with convolutional neural network technology.Methods
This study included 1500 Optos color images for tessellated fundus confirmation and peripheral retinal lesion (lattice degen...
Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a quick glance at the progress we have made, we attribute chal...
Rapid pace of generative models has brought about new threats to visual forensics such as malicious personation and digital copyright infringement, which promotes works on fake image attribution. Existing works on fake image attribution mainly rely on a direct classification framework. Without additional supervision, the extracted features could in...
The cover image is based on the Research Article Deep learning‐based detection and stage grading for optimising diagnosis of diabetic retinopathy by Yuelin Wang et al., https://doi.org/10.1002/dmrr.3445.
Currently, existing state-of-the-art 3D object detectors are in two-stage paradigm. These methods typically comprise two steps: 1) Utilize region proposal network to propose a fraction of high-quality proposals in a bottom-up fashion. 2) Resize and pool the semantic features from the proposed regions to summarize RoI-wise representations for furthe...
The key challenge of image manipulation detection is how to learn generalizable features that are sensitive to manipulations in novel data, whilst specific to prevent false alarms on authentic images. Current research emphasizes the sensitivity, with the specificity overlooked. In this paper we address both aspects by multi-view feature learning an...
Expanding visual categorization into a novel domain without the need of extra annotation has been a long-term interest for multimedia intelligence. Previously, this challenge has been approached by unsupervised domain adaptation (UDA). Given labeled data from a source domain and unlabeled data from a target domain, UDA seeks for a deep representati...
Aim
To explore and evaluate an appropriate deep learning system (DLS) for the detection of 12 major fundus diseases using colour fundus photography.
Methods
Diagnostic performance of a DLS was tested on the detection of normal fundus and 12 major fundus diseases including referable diabetic retinopathy, pathologic myopic retinal degeneration, reti...