
Zhedong ZhengNational University of Singapore | NUS
Zhedong Zheng
PhD
About
83
Publications
42,980
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,745
Citations
Introduction
Hi! I am currently a postdoctoral research fellow at NExT++, School of Computing, National University of Singapore with Prof. Tat-Seng Chua. I received Ph.D. from the ReLER Lab, University of Technology Sydney (UTS) , under the supervision of Prof. Yi Yang and Dr. Liang Zheng. I received my Bachelor’s degree from Fudan University in 2016, under the supervision of Prof. Xiangyang Xue.
Publications
Publications (83)
This paper considers the task of matching images and sentences. The challenge consists in discriminatively embedding the two modalities onto a shared visual-textual space. Existing work in this field largely uses Recurrent Neural Networks (RNN) for text feature learning and employs off-the-shelf Convolutional Neural Networks (CNN) for image feature...
The main contribution of this paper is a simple semisupervised
pipeline that only uses the original training set
without collecting extra data. It is challenging in 1) how
to obtain more training data only from the training set and
2) how to use the newly generated data. In this work, the
generative adversarial network (GAN) is used to generate
unl...
This paper focuses on the unsupervised domain adaptation of transferring the knowledge from the source domain to the target domain in the context of semantic segmentation. Existing approaches usually regard the pseudo label as the ground truth to fully exploit the unlabeled target-domain data. Yet the pseudo labels of the target-domain data are usu...
Person re-identification (re-id) remains challenging due to significant intra-class variations across different cam- eras. Recently, there has been a growing interest in using generative models to augment training data and enhance the invariance to input changes. The generative pipelines in existing methods, however, stay relatively separate from t...
Information retrieval (IR) is a fundamental technique that aims to acquire information from a collection of documents, web pages, or other sources. While traditional text-based IR has achieved great success, the under-utilization of varied data sources in different modalities (i.e., text, images, audio, and video) would hinder IR techniques from gi...
Unmanned Aerial Vehicles (UAVs), also known as drones, have
become increasingly popular in recent years due to their ability
to capture high-quality multimedia data from the sky. With the
rise of multimedia applications, such as aerial photography, cine-
matography, and mapping, UAVs have emerged as a powerful tool
for gathering rich and diverse mu...
Unmanned Aerial Vehicles (UAVs), also known as drones, have become increasingly popular in recent years due to their ability to capture high-quality multimedia data from the sky. With the rise of multimedia applications, such as aerial photography, cine-matography, and mapping, UAVs have emerged as a powerful tool for gathering rich and diverse mul...
In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1,510,330 image-text pairs, which is about 37.5 times larger than...
Language-guided image retrieval enables users to search for images and interact with the retrieval system more naturally and expressively by using a reference image and a relative caption as a query. Most existing studies mainly focus on designing image-text composition architecture to extract discriminative visual-linguistic relations. Despite gre...
This paper studies compositional 3D-aware image synthesis for both single-object and multi-object scenes. We observe that two challenges remain in this field: existing approaches (1) lack geometry constraints and thus compromise the multi-view consistency of the single object, and (2) can not scale to multi-object scenes with complex backgrounds. T...
Existing task-oriented conversational search systems heavily rely on domain ontologies with pre-defined slots and candidate value sets. In practical applications, these prerequisites are hard to meet, due to the emerging new user requirements and ever-changing scenarios. To mitigate these issues for better interaction performance, there are efforts...
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities. Prior work usually focuses on the pairwise relations (i.e., whether a data sample matches another) but ignores the higher-order neighbor relations (i.e., a matching structure among multiple data samples). Re-ranking, a popular post-...
Recent research on video moment retrieval has mostly
focused on enhancing the performance of accuracy, efficiency, and robustness, all of which largely rely on the abundance of high-quality annotations. While the precise frame-level annotations are time-consuming and cost-expensive,
few attentions have been paid to the labeling process. In
this wor...
Most image retrieval works aim at learning discriminative visual features, while little attention is paid to the retrieval efficiency. The speed of feature extraction is key to the real-world system. Therefore, in this paper, we focus on network pruning for image retrieval acceleration. Different from the classification models predicting discrete c...
Sign language recognition (SLR) aims to overcome the communication barrier for the people with deafness or the people with hard hearing. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. RGB-based approaches usually overlook the fin...
This paper aims to craft adversarial queries for image retrieval, which uses image features for similarity measurement. Many commonly used methods are developed in the context of image classification. However, these methods, which attack prediction probabilities, only exert an indirect influence on the image features and are thus found less effecti...
Traffic accident prediction in driving videos aims to provide an early warning of the accident occurrence, and supports the decision making of safe driving systems. Previous works usually concentrate on the spatial-temporal correlation of object-level context, while they do not fit the inherent long-tailed data distribution well and are vulnerable...
Unsupervised Domain Adaptation (UDA) aims to enhance the generalization of the learned model to other domains. The domain-invariant knowledge is transferred from the model trained on labeled source domain, e.g., video game, to unlabeled target domains, e.g., real-world scenarios, saving annotation expenses. Existing UDA methods for semantic segment...
We investigate composed image retrieval with text feedback. Users gradually look for the target of interest by moving from coarse to fine-grained feedback. However, existing methods merely focus on the latter, i.e, fine-grained search, by harnessing positive and negative pairs during training. This pair-based paradigm only considers the one-to-one...
Person search aims at localizing and recognizing query persons from raw video frames, which is a combination of two sub-tasks,
i.e., pedestrian detection and person re-identification. The dominant fashion is termed as the one-step person search that jointly optimizes detection and identification in a unified network, exhibiting higher efficiency....
Cross-view geo-localization aims to spot images of the same location shot from two platforms, e.g., the drone platform and the satellite platform. Existing methods usually focus on optimizing the distance between one embedding with others in the feature space, while neglecting the redundancy of the embedding itself. In this paper, we argue that the...
In the driving scene, the road participants usually show frequent interaction and intention understanding with the surrounding. Ego-agent (each road participant itself) conducts the prediction of what behavior will be done by other road users all the time and expects a shared and consistent understanding. For instance, we need to predict the next m...
People live in a 3D world. However, existing works on person re-identification (re-id) mostly consider the semantic representation learning in a 2D space, intrinsically limiting the understanding of people. In this work, we address this limitation by exploring the prior knowledge of the 3D body structure. Specifically, we project 2D images to a 3D...
Domain adaptation is to transfer the shared knowledge learned from the source domain to a new environment,
i.e.
, target domain. One common practice is to train the model on both labeled source-domain data and unlabeled target-domain data. Yet the learned models are usually biased due to the strong supervision of the source domain. Most researche...
Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference s...
Sign language is the window for people differently-abled to express their feelings as well as emotions. However, it remains challenging for people to learn sign language in a short time. To address this real-world challenge, in this work, we study the motion transfer system, which can transfer the user photo to the sign language video of specific w...
In this paper, we study the cross-view geo-localization problem to match images from different viewpoints. The key motivation underpinning this task is to learn a discriminative viewpoint-invariant visual representation. Inspired by the human visual system for mining local patterns, we propose a new framework called RK-Net to jointly learn the disc...
This research aims to study a self-supervised 3D clothing reconstruction method, which recovers the geometry shape, and texture of human clothing from a single 2D image. Compared with existing methods, we observe that three primary challenges remain: (1) the conventional template-based methods are limited to modeling non-rigid clothing objects, e.g...
Aerial-view geo-localization tends to determine an unknown position through matching the drone-view image with the geo-tagged satellite-view image. This task is mostly regarded as an image retrieval problem. The key underpinning this task is to design a series of deep neural networks to learn discriminative image descriptors. However, existing meth...
3D-aware image synthesis aims to generate images of objects from multiple views by learning a 3D representation. However, one key challenge remains: existing approaches lack geometry constraints, hence usually fail to generate multi-view consistent images. To address this challenge, we propose Multi-View Consistent Generative Adversarial Networks (...
Image-based virtual try-on is challenging in fitting a target in-shop clothes onto a reference person under diverse human poses. Previous works focus on preserving clothing details (
e.g.,
texture, logos, patterns) when transferring desired clothes onto a target person under a fixed pose. However, the performances of existing methods significantly...
Text-video retrieval is one of the basic tasks for multimodal research and has been widely harnessed in many real-world systems. Most existing approaches directly compare the global representation between videos and text descriptions and utilize the global contrastive loss to train the model. These designs overlook the local alignment and the word-...
The manual annotation for large-scale point clouds costs a lot of time and is usually unavailable in harsh real-world scenarios. Inspired by the great success of the pre-training and fine-tuning paradigm in both vision and language tasks, we argue that pre-training is one potential solution for obtaining a scalable model to 3D point cloud downstrea...
Deep learning has shown significant successes in person reidentification (re-id) tasks. However, most existing works focus on discriminative feature learning and impose complex neural networks, suffering from low inference efficiency. In fact, feature extraction time is also crucial for real-world applications and lightweight models are needed. Pre...
The annotation for large-scale point clouds is still time-consuming and unavailable for many real-world tasks. Point cloud pre-training is one potential solution for obtaining a scalable model for fast adaptation. Therefore, in this paper, we investigate a new self-supervised learning approach, called Mixing and Disentangling (MD), for point cloud...
Image-based virtual try-on is challenging in fitting a target in-shop clothes into a reference person under diverse human poses. Previous works focus on preserving clothing details ( e.g., texture, logos, patterns ) when transferring desired clothes onto a target person under a fixed pose. However, the performances of existing methods significantly...
Obtaining viewer responses from videos can be useful for creators and streaming platforms to analyze the video performance and improve the future user experience. In this report, we present our method for 2021 Evoked Expression from Videos Challenge. In particular, our model utilizes both audio and image modalities as inputs to predict emotion chan...
Vehicle search is one basic task for the efficient traffic management in terms of the AI City. Most existing practices focus on the image-based vehicle matching, including vehicle re-identification and vehicle tracking. In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest and explore the potent...
The goal of person search is to localize and match query persons from scene images. For high efficiency, one-step methods have been developed to jointly handle the pedestrian detection and identification sub-tasks using a single network. There are two major challenges in the current one-step approaches. One is the mutual interference between the op...
Domain adaptation is to transfer the shared knowledge learned from the source domain to a new environment, i.e., target domain. One common practice is to train the model on both labeled source-domain data and unlabeled target-domain data. Yet the learned models are usually biased due to the strong supervision of the source domain. Most researchers...
The goal of person search is to localize and match query persons from scene images. For high efficiency, one-step methods have been developed to jointly handle the pedestrian detection and identification sub-tasks using a single network. There are two major challenges in the current one-step approaches. One is the mutual interference between the op...
Cross-view geo-localization is to spot images of the same geographic target from different platforms, e.g., drone-view cameras and satellites. It is challenging in the large visual appearance changes caused by extreme viewpoint variations. Existing methods usually concentrate on mining the fine-grained feature of the geographic target in the image...
The re-ranking approach leverages high-confidence retrieved samples to refine retrieval results, which have been widely adopted as a post-processing tool for image retrieval tasks. However, we notice one main flaw of re-ranking, i.e., high computational complexity, which leads to an unaffordable time cost for real-world applications. In this paper,...
Cross-view geo-localization is to spot images of the same geographic target from different platforms, e.g., drone-view cameras and satellites. It is challenging in the large visual appearance changes caused by extreme viewpoint variations. Existing methods usually concentrate on mining the fine-grained feature of the geographic target in the image...
One fundamental challenge of vehicle re-identification (re-id) is to learn robust and discriminative visual representation, given the significant intra-class vehicle variations across different camera views. As the existing vehicle datasets are limited in terms of training images and viewpoints, we propose to build a unique large-scale vehicle data...
This work focuses on the unsupervised scene adaptation problem of learning from both labeled source data and unlabeled target data. Existing approaches focus on minoring the inter-domain gap between the source and target domains. However, the intra-domain knowledge and inherent uncertainty learned by the network are under-explored. In this paper, w...
This paper focuses on the real-world automatic makeup problem. Given one non-makeup target image and one reference image, the automatic makeup is to generate one face image, which maintains the original identity with the makeup style in the reference image. In the real-world scenario, face makeup task demands a robust system against the environment...
Eyeglasses removal is challenging in removing different kinds of eyeglasses, e.g., rimless glasses, full-rim glasses, and sunglasses, and recovering appropriate eyes. Due to the significant visual variants, the conventional methods lack scalability. Most existing works focus on the frontal face images in the controlled environment, such as the labo...
People live in a 3D world. However, existing works on person re-identification (re-id) mostly consider the representation learning in a 2D space, intrinsically limiting the understanding of people. In this work, we address this limitation by exploring the prior knowledge of the 3D body structure. Specifically, we project 2D images to a 3D space and...
One fundamental challenge of vehicle re-identification (re-id) is to learn robust and discriminative visual representation, given the significant intra-class vehicle variations across different camera views. As the existing vehicle datasets are limited in terms of training images and viewpoints, we propose to build a unique large-scale vehicle data...
This paper focuses on the unsupervised domain adaptation of transferring the knowledge from the source domain to the target domain in the context of semantic segmentation. Existing approaches usually regard the pseudo label as the ground truth to fully exploit the unlabeled target-domain data. Yet the pseudo labels of the target-domain data are usu...
We consider the problem of cross-view geo-localization. The primary challenge of this task is to learn the robust feature against large viewpoint changes. Existing benchmarks can help, but are limited in the number of viewpoints. Image pairs, containing two viewpoints, e.g., satellite and ground, are usually provided, which may compromise the featu...
This paper focuses on network pruning for image retrieval acceleration. Prevailing image retrieval works target at the discriminative feature learning, while little attention is paid to how to accelerate the model inference, which should be taken into consideration in real-world practice. The challenge of pruning image retrieval models is that the...
We consider the unsupervised scene adaptation problem of learning from both labeled source data and unlabeled target data. Existing methods focus on minoring the inter-domain gap between the source and target domains. However, the intra-domain knowledge and inherent uncertainty learned by the network are under-explored. In this paper, we propose an...
Eyeglasses removal is challenging in removing different kinds of eyeglasses, e.g., rimless glasses, full-rim glasses and sunglasses, and recovering appropriate eyes. Due to the large visual variants, the conventional methods lack scalability. Most existing works focus on the frontal face images in the controlled environment such as laboratory and n...
Vehicle re-identification (re-id) remains challenging due to significant intra-class variations across different cameras. In this paper, we present our solution to AICity Vehicle Re-id Challenge 2019. The limited training data motivates us to leverage the free data from the web and deploy the two-stage learning strategy. The success of large-scale...
Person re-identification (re-id) remains challenging due to significant intra-class variations across different cameras. Recently, there has been a growing interest in using generative models to augment training data and enhance the invariance to input changes. The generative pipelines in existing methods, however, stay relatively separate from the...
Sufficient training data normally is required to train deeply learned models. However, due to the expensive manual process for a labeling large number of images (
i.e.
, annotation), the amount of available training data (
i.e.
, real data) is always limited. To produce more data for training a deep network, generative adversarial network can be...
In human parsing, the pixel-wise classification loss has drawbacks in its low-level local inconsistency and high-level semantic inconsistency. The introduction of the adversarial network tackles the two problems using a single discriminator. However, the two types of parsing inconsistency are generated by distinct mechanisms, so it is difficult for...
Adversarial examples in recent works target at closed set recognition systems, in which the training and testing classes are identical. In real-world scenarios, however, the testing classes may have limited, if any, overlap with the training classes, a problem named open set recognition. To our knowledge, the community does not have a specific desi...
In human parsing, the pixel-wise classification loss has drawbacks in its low-level local inconsistency and high-level semantic inconsistency. The introduction of the adversarial network tackles the two problems using a single discriminator. However, the two types of parsing inconsistency are generated by distinct mechanisms, so it is difficult for...
Person re-identification (re-ID) is challenging because pedestrians may exhibit distinct appearance under different cameras. Given a query image, previous methods usually output the person retrieval results directly, which may perform badly due to the limited information provided by the single query image. To mine more query information, we add an...