Xilin Chen

Xilin Chen
Chinese Academy of Sciences | CAS

About

561
Publications
95,489
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
32,222
Citations
Additional affiliations
January 2011 - present
Tsinghua University
January 2011 - present
January 2010 - present
University of Oulu

Publications

Publications (561)
Preprint
Full-text available
Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Lang...
Preprint
The primary goal of out-of-distribution (OOD) detection tasks is to identify inputs with semantic shifts, i.e., if samples from novel classes are absent in the in-distribution (ID) dataset used for training, we should reject these OOD samples rather than misclassifying them into existing ID classes. However, we find the current definition of "seman...
Preprint
Pre-trained vision-language models (e.g., CLIP) have shown powerful zero-shot transfer capabilities. But they still struggle with domain shifts and typically require labeled data to adapt to downstream tasks, which could be costly. In this work, we aim to leverage unlabeled data that naturally spans multiple domains to enhance the transferability o...
Preprint
Recently, large-scale diffusion models have made impressive progress in text-to-image (T2I) generation. To further equip these T2I models with fine-grained spatial control, approaches like ControlNet introduce an extra network that learns to follow a condition image. However, for every single condition type, ControlNet requires independent training...
Preprint
Full-text available
Categorization, a core cognitive ability in humans that organizes objects based on common features, is essential to cognitive science as well as computer vision. To evaluate the categorization ability of visual AI models, various proxy tasks on recognition from datasets to open world scenarios have been proposed. Recent development of Large Multimo...
Conference Paper
Full-text available
Facial action unit (AU) recognition is essential for recognizing fine-grained changes in facial expression, while the demand for a large amount of accurately labeled AU data for training purposes has resulted in high labor costs. Nevertheless, massive face images are widely available and inaccurate labels can be easily obtained, especially as large...
Article
Semi-supervised learning (SSL) suffers from severe performance degradation when labeled and unlabeled data come from inconsistent and imbalanced distribution. Nonetheless, there is a lack of theoretical guidance regarding a remedy for this issue. To bridge the gap between theoretical insights and practical solutions, we embark to an analysis of gen...
Preprint
Full-text available
Traditional semi-supervised learning (SSL) assumes that the feature distributions of labeled and unlabeled data are consistent which rarely holds in realistic scenarios. In this paper, we propose a novel SSL setting, where unlabeled samples are drawn from a mixed distribution that deviates from the feature distribution of labeled samples. Under thi...
Preprint
Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as...
Article
Recognition in open-world scenarios is an important and challenging field, where Vision-Language Pre-training paradigms have greatly impacted the 2D domain. This inspires a growing interest in introducing 2D pre-trained models, such as CLIP, into the 3D domain to enhance the ability of point cloud understanding. Considering the difference between d...
Article
Attributed to the development of deep networks and abundant data, automatic face recognition (FR) has quickly reached human-level capacity in the past few years. However, the FR problem is not perfectly solved in case of large poses and uncontrolled occlusions. In this paper, we propose a novel bypass enhanced representation learning (BERL) method...
Article
Video question answering (VideoQA) is challenging since it requires the model to extract and combine multi-level visual concepts from local objects to global actions from complex events for compositional reasoning. Existing works represent the video with fixed-duration clip features that make the model struggle in capturing the crucial concepts in...
Article
Deepfake represented by face swapping and face reenactment can transfer the appearance and behavioral expressions of a face in one video image to another face in a different video. In recent years, with the advancement of deep learning techniques, deepfake technology has developed rapidly, achieving increasingly realistic effects. Therefore, many r...
Article
Facial recognition (FR) technology offers convenience in our daily lives, but it also raises serious privacy issues due to unauthorized FR applications. To protect facial privacy, existing methods have proposed adversarial face examples that can fool FR systems. However, most of these methods work only in the digital domain and do not consider natu...
Article
Blood vessel and surgical instrument segmentation is a fundamental technique for robot-assisted surgical navigation. Despite the significant progress in natural image segmentation, surgical image-based vessel and instrument segmentation are rarely studied. In this work, we propose a novel self-supervised pretraining method (SurgNet) that can effect...
Preprint
Learning generalizable representation and classifier for class-imbalanced data is challenging for data-driven deep models. Most studies attempt to re-balance the data distribution, which is prone to overfitting on tail classes and underfitting on head classes. In this work, we propose Dual Compensation Residual Networks to better fit both tail and...
Article
Full-text available
Scene graph aims to faithfully reveal humans’ perception of image content. When humans look at a scene, they usually focus on their interested parts in a special priority. This innate habit indicates a hierarchical preference about human perception. Therefore, we argue to generate the Scene Graph of Interest which should be hierarchically construct...
Conference Paper
Full-text available
Existing studies indicate that deep neural networks (DNNs) can eventually memorize the label noise. We observe that the memorization strength of DNNs towards each instance is different and can be represented by the confidence value, which becomes larger and larger during the training process. Based on this, we propose a Dynamic Instance-specific Se...
Article
Learning generalizable representation and classifier for class-imbalanced data is challenging for data-driven deep models. Most studies attempt to re-balance the data distribution, which is prone to overfitting on tail classes and underfitting on head classes. In this work, we propose Dual Compensation Residual Networks to better fit both tail and...
Preprint
Feature distillation makes the student mimic the intermediate features of the teacher. Nearly all existing feature-distillation methods use L2 distance or its slight variants as the distance metric between teacher and student features. However, while L2 distance is isotropic w.r.t. all dimensions, the neural network's operation on different dimensi...
Chapter
The Visual Object Tracking challenge VOT2022 is the tenth annual tracker benchmarking activity organized by the VOT initiative. Results of 93 entries are presented; many are state-of-the-art trackers published at major computer vision conferences or in journals in recent years. The VOT2022 challenge was composed of seven sub-challenges focusing on...
Article
Cross-modality face image synthesis such as sketch-to-photo, NIR-to-RGB, and RGB-to-depth has wide applications in face recognition, face animation, and digital entertainment. Conventional cross-modality synthesis methods usually require paired training data, i.e., each subject has images of both modalities. However, paired data can be difficult to...
Chapter
Connectionist Temporal Classification (CTC) is a popular objective function in sequence recognition, which provides supervision for unsegmented sequence data through aligning sequence and its corresponding labeling iteratively. The blank class of CTC plays a crucial role in the alignment process and is often considered responsible for the peaky beh...
Chapter
The current popular two-stream, two-stage tracking framework extracts the template and the search region features separately and then performs relation modeling, thus the extracted features lack the awareness of the target and have limited target-background discriminability. To tackle the above issue, we propose a novel one-stream tracking (OSTrack...
Article
Alternatively inferring on the visual facts and commonsense is fundamental for an advanced VQA system. This ability requires models to go beyond the literal understanding of commonsense. The system should not just treat objects as the entrance to query background knowledge, but fully ground commonsense to the visual world and imagine the possible r...
Article
Class Incremental Learning (CIL), an indispensable ability for open-world applications such as service robots, has received increasing attention in recent years. Although many CIL methods sprouted out, researchers usually adopt default class orders, leaving the characteristics of different class orders less visited. In this paper, we rethink class...
Preprint
The key to address clothes-changing person re-identification (re-id) is to extract clothes-irrelevant features, e.g., face, hairstyle, body shape, and gait. Most current works mainly focus on modeling body shape from multi-modality information (e.g., silhouettes and sketches), but do not make full use of the clothes-irrelevant information in the or...
Article
Full-text available
Face recognition has been significantly advanced by deep learning based methods. In all face recognition methods based on convolutional neural network (CNN), the convolutional kernels for feature extraction are fixed regardless of the input face once the training stage is finished. By contrast, we humans are usually impressed by some unique charact...
Article
The state of the art person search methods solve person search by two steps, detection and re-ID, but neglect the consistency need between these two stages. The re-ID stage needs more accurate detection results on query targets and fewer boxes of other people; The detection stage needs that the re-ID stage has robustness against unavailable detecti...
Article
Face recognition remains a challenging task in unconstrained scenarios, especially when faces are partially occluded. To improve the robustness against occlusion, augmenting the training images with artificial occlusions has been proved as a useful approach. However, these artificial occlusions are commonly generated by adding a black rectangle or...
Preprint
Teaching machines to recognize a new category based on few training samples especially only one remains challenging owing to the incomprehensive understanding of the novel category caused by the lack of data. However, human can learn new classes quickly even given few samples since human can tell what discriminative features should be focused on ab...
Conference Paper
Full-text available
In this paper, we propose a new method for remote photoplethysmography (rPPG) based heart rate (HR) estimation. In particular, our proposed method BVPNet is streamlined to predict the blood volume pulse (BVP) signals from face videos. Towards this, we firstly define ROIs based on facial landmarks and then extract the raw temporal signal from each R...
Article
Capturing long-range dependencies during feature extraction is crucial for video-based person re-identification (re-id) since it would help to tackle many challenging problems such as occlusion and dramatic pose variation. Moreover, capturing subtle differences, such as bags and glasses, is indispensable to distinguish similar pedestrians. In this...
Conference Paper
Full-text available
In order to train 3D gaze estimators without too many annotations, we propose an unsupervised learning framework, Cross-Encoder, to leverage the unlabeled data to learn suitable representation for gaze estimation. To address the issue that the feature of gaze is always intertwined with the appearance of the eye, Cross-Encoder disentangles the featu...
Conference Paper
Full-text available
Vision-based Continuous Sign Language Recognition (CSLR) aims to recognize unsegmented signs from image streams. Overfitting is one of the most critical problems in CSLR training, and previous works show that the iterative training scheme can partially solve this problem while also costing more training time. In this study, we revisit the iterative...
Article
Location is an important distinguishing information for instance segmentation. In this paper, we propose a novel model, called Location Sensitive Network (LSNet), for human instance segmentation. LSNet integrates instance-specific location information into one-stage segmentation framework. Specifically, in the segmentation branch, Pose Attention Mo...
Preprint
Full-text available
We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider the relationships among the candidates. This potentially limits performance, especially in challen...
Preprint
Face recognition remains a challenging task in unconstrained scenarios, especially when faces are partially occluded. To improve the robustness against occlusion, augmenting the training images with artificial occlusions has been proved as a useful approach. However, these artificial occlusions are commonly generated by adding a black rectangle or...
Preprint
Full-text available
This paper presents a method for gaze estimation according to face images. We train several gaze estimators adopting four different network architectures, including an architecture designed for gaze estimation (i.e.,iTracker-MHSA) and three originally designed for general computer vision tasks(i.e., BoTNet, HRNet, ResNeSt). Then, we select the best...
Preprint
Full-text available
Person re-identification (reID) plays an important role in computer vision. However, existing methods suffer from performance degradation in occluded scenes. In this work, we propose an occlusion-robust block, Region Feature Completion (RFC), for occluded reID. Different from most previous works that discard the occluded regions, RFC block can reco...
Article
Background There is a large group of deaf-mutes in the world, and sign language is their major communication tool. Therefore, it’s necessary for the deaf-mutes to communicate with hearing-speech people and the hearing-speech people also have needed to understand sign language, which produces a great demand for sign language teaching. Even though th...
Article
Apathy is characterized by symptoms such as reduced emotional response, lack of motivation, and limited social interaction. Current methods for apathy diagnosis require the patient’s presence in a clinic and time consuming clinical interviews, which are costly and inconvenient for both, patients and clinical staff, hindering among other large-scale...
Article
Person re-identification (reID) plays an important role in computer vision. However, existing methods suffer from performance degradation in occluded scenes. In this work, we propose an occlusion-robust block, Region Feature Completion (RFC), for occluded reID. Different from most previous works that discard the occluded regions, RFC block can re...
Preprint
Full-text available
This paper proposes a novel model, named Continuity-Discrimination Convolutional Neural Network (CD-CNN), for visual object tracking. Existing state-of-the-art tracking methods do not deal with temporal relationship in video sequences, which leads to imperfect feature representations. To address this problem, CD-CNN models temporal appearance conti...
Preprint
Full-text available
Vision-based Continuous Sign Language Recognition (CSLR) aims to recognize unsegmented gestures from image sequences. To better train CSLR models, the iterative training scheme is widely adopted to alleviate the overfitting of the alignment model. Although the iterative training scheme can improve performance, it will also increase the training tim...
Article
For a given text, previous text-to-image synthesis methods commonly utilize a multistage generation model to produce images with high resolution in a coarse-to-fine manner. However, these methods ignore the interaction among stages, and they do not constrain the consistent cross-sample relations of images generated in different stages. These defici...
Article
Due to the environmental differences, many face anti-spoofing methods fail to generalize to unseen scenarios. In light of this, we propose a unified unsupervised and semi-supervised domain adaptation network (USDAN) for cross-scenario face anti-spoofing, aiming at minimizing the distribution discrepancy between the source and the target domains. Sp...
Chapter
Scene graph aims to faithfully reveal humans’ perception of image content. When humans analyze a scene, they usually prefer to describe image gist first, namely major objects and key relations in a scene graph. This humans’ inherent perceptive habit implies that there exists a hierarchical structure about humans’ preference during the scene parsing...
Preprint
Full-text available
Lip reading, also known as visual speech recognition, aims to recognize the speech content from videos by analyzing the lip dynamics. There have been several appealing progress in recent years, benefiting much from the rapidly developed deep learning techniques and the recent large-scale lip-reading datasets. Most existing methods obtained high per...
Chapter
Due to the imperfect person detection results and posture changes, temporal appearance misalignment is unavoidable in video-based person re-identification (ReID). In this case, 3D convolution may destroy the appearance representation of person video clips, thus it is harmful to ReID. To address this problem, we propose Appearance-Preserving 3D Conv...
Chapter
Although two-stage object detectors have continuously advanced the state-of-the-art performance in recent years, the training process itself is far from crystal. In this work, we first point out the inconsistency problem between the fixed network settings and the dynamic training procedure, which greatly affects the performance. For example, the fi...
Chapter
This paper proposes a Temporal Complementary Learning Network that extracts complementary features of consecutive video frames for video person re-identification. Firstly, we introduce a Temporal Saliency Erasing (TSE) module including a saliency erasing operation and a series of ordered learners. Specifically, for a specific frame of a video, the...
Article
In this paper, we address one specific video retrieval problem in terms of human face. Given one query in forms of either a frame or a sequence from a person, we search the database and return the most relevant face videos, i.e., ones have the same class label with the query. Such problem is very challenging due to the large intra-class variations...