About
301
Publications
20,806
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,203
Citations
Publications
Publications (301)
Robots can acquire complex manipulation skills by learning policies from expert demonstrations, which is often known as vision-based imitation learning. Generating policies based on diffusion and flow matching models has been shown to be effective, particularly in robotics manipulation tasks. However, recursion-based approaches are often inference...
Homography estimation is a fundamental task in computer vision with applications in diverse fields. Recent advances in deep learning have improved homography estimation, particularly with unsupervised learning approaches, offering increased robustness and generalizability. However, accurately predicting homography, especially in complex motions, re...
In this paper, we propose a temporal group alignment and fusion network to enhance the quality of compressed videos by using the long-short term correlations between frames. The proposed model consists of the intra-group feature alignment (IntraGFA) module, the inter-group feature fusion (InterGFF) module, and the feature enhancement (FE) module. W...
Compressed video super-resolution (VSR) is employed to generate high-resolution (HR) videos from low-resolution (LR) compressed videos. Recently, some compressed VSR methods have adopted coding priors, such as partition maps, compressed residual frames, predictive pictures and motion vectors, to generate HR videos. However, these methods disregard...
With the rapid development of machine vision technology in recent years, many researchers have begun to focus on feature compression that is better suited for machine vision tasks. The target of feature compression is deep features, which arise from convolution in the middle layer of a pre-trained convolutional neural network. However, due to the l...
Accurate segmentation of lesions is crucial for diagnosis and treatment of early esophageal cancer (EEC). However, neither traditional nor deep learning-based methods up to today can meet the clinical requirements, with the mean Dice score - the most important metric in medical image analysis - hardly exceeding 0.75. In this paper, we present a nov...
Supervised homography estimation methods face a challenge due to the lack of adequate labeled training data. To address this issue, we propose DMHomo, a diffusion model-based framework for supervised homography learning. This framework generates image pairs with accurate labels, realistic image content, and realistic interval motion, ensuring they...
Existing homography and optical flow methods are erroneous in challenging scenes, such as fog, rain, night, and snow because the basic assumptions such as brightness and gradient constancy are broken. To address this issue, we present an unsupervised learning approach that fuses gyroscope into homography and optical flow learning. Specifically, we...
Video super-resolution (VSR) is used to compose high-resolution (HR) video from low-resolution video. Recently, the deformable alignment-based VSR methods are becoming increasingly popular. In these methods, the features extracted from video are aligned to eliminate the motion error targeting high super-resolution (SR) quality. However, these metho...
In this paper, we propose a temporal group alignment and fusion network to enhance the quality of compressed videos by using the long-short term correlations between frames. The proposed model consists of the intra-group feature alignment (IntraGFA) module, the inter-group feature fusion (InterGFF) module, and the feature enhancement (FE) module. W...
We present a novel deep camera path optimization framework for minimum latency online video stabilization. Typically, a stabilization pipeline consists of three steps: motion estimation, path smoothing, and novel view synthesis. Most previous methods concentrate on motion estimation while path optimization receives less attention, particularly in t...
Unsupervised methods have received increasing attention in homography learning due to their promising performance and label-free training. However, existing methods do not explicitly consider the plane-induced parallax, making the prediction compromised on multiple planes. In this work, we propose a novel method HomoGAN to guide unsupervised homogr...
Omnidirectional video streaming is usually implemented based on the representations of tiles, where the tiles are obtained by splitting the video frame into several rectangular areas and each tile is converted into multiple representations with different resolutions and encoded at different bitrates. One key issue in omnidirectional video streaming...
In this paper, we propose a new method to inpaint videos with removed regions. Our method was developed based upon combining both short-term propagation-based inpainting (STPI) and long-term propagation-based inpainting (LTPI) modules. The STPI module is designed to in-fill an image from a single frame with local reference information, whilst the L...
In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data and yield a supervised homography network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with anoth...
Accurate segmentation of lesions is crucial for diagnosis and treatment of early esophageal cancer (EEC). However, neither traditional nor deep learning-based methods up to today can meet the clinical requirements, with the mean Dice score - the most important metric in medical image analysis - hardly exceeding 0.75. In this paper, we present a nov...
This paper proposes a hybrid synthesis method for multi-exposure image fusion taken by hand-held cameras. Motions either due to the shaky camera or caused by dynamic scenes should be compensated before any content fusion. Any misalignment can easily cause blurring/ghosting artifacts in the fused result. Our hybrid method can deal with such motions...
Existing homography and optical flow methods are erroneous in challenging scenes, such as fog, rain, night, and snow because the basic assumptions such as brightness and gradient constancy are broken. To address this issue, we present an unsupervised learning approach that fuses gyroscope into homography and optical flow learning. Specifically, we...
In this letter, we propose two solutions for the rate control of the VVC low-delay coding. Both solutions are developed by determining the bit allocation factors for video frames based on their dependency. Specifically, we design the first solution according to the distortion correlation between the key-frame and its subsequent frames. With this so...
Blind image super-resolution (BISR) aims to construct high-resolution image from low-resolution (LR) image that contains unknown degradation. Although the previous methods demonstrated impressive performance by introducing the degradation representation in BISR task, there still exist two problems in most of them. First, they ignore the degradation...
We propose a Generative Adversarial Network (GAN)-based architecture for achieving high-quality physically based rendering (PBR). Conventional PBR relies heavily on ray tracing, which is computationally expensive in complicated environments. Some recent deep learning-based methods can improve efficiency but cannot deal with illumination variation w...
Video inpainting aims to fill in missing regions of a video after any undesired contents are removed from it. This technique can be applied to repair the broken video or edit the video content. In this paper, we propose a depth-guided deep video inpainting network (DGDVI) and demonstrate its effectiveness in processing challenging broken areas cros...
In this paper, we introduce a new framework for unsupervised deep homography estimation. Our contributions are 3 folds. First, unlike previous methods that regress 4 offsets for a homography, we propose a homography flow representation, which can be estimated by a weighted sum of 8 pre-defined homography flow bases. Second, considering a homography...
High dynamic range (HDR) deghosting algorithms aim to generate ghost-free HDR images with realistic details. Restricted by the locality of the receptive field, existing CNN-based methods are typically prone to producing ghosting artifacts and intensity distortions in the presence of large motion and severe saturation. In this paper, we propose a no...
The paper proposes a method to effectively fuse multi-exposure inputs and generate high-quality high dynamic range (HDR) images with unpaired datasets. Deep learning-based HDR image generation methods rely heavily on paired datasets. The ground truth images play a leading role in generating reasonable HDR images. Datasets without ground truth are h...
Convolution neural network (CNN) based methods offer effective solutions for enhancing the quality of compressed image and video. However, these methods ignore using the raw data to enhance the quality. In this paper, we adopt the raw data in the quality enhancement for the HEVC intra-coded image by proposing an online learning-based method. When q...
We propose a decoder-friendly chrominance enhancement method for the compressed images. Our proposed method is developed based on the luminance-guided chrominance enhancement network (LGCEN) and online learning. With LGCEN, the textures of the compressed chrominance components are enhanced by the guidance of luminance component. Moreover, LGCEN is...
Data association is important in the point cloud registration. In this work, we propose to solve the partial-to-partial registration from a new perspective, by introducing multi-level feature interactions between the source and the reference clouds at the feature extraction stage, such that the registration can be realized without the attentions or...
In this paper, we propose a luminance-guided chrominance image enhancement convolutional neural network for HEVC intra coding. Specifically, we firstly develop a gated recursive asymmetric-convolution block to restore each degraded chrominance image, which generates an intermediate output. Then, guided by the luminance image, the quality of this in...
In this paper, we propose a new fast CU partition algorithm for VVC intra coding based on cross-block difference. This difference is measured by the gradient and the content of sub-blocks obtained from partition and is employed to guide the skipping of unnecessary horizontal and vertical partition modes. With this guidance, a fast determination of...
3D object detection has become an emerging task in autonomous driving scenarios. Most of previous works process 3D point clouds using either projection-based or voxel-based models. However, both approaches contain some drawbacks. The voxel-based methods lack semantic information, while the projection-based methods suffer from numerous spatial infor...
The paper proposes a solution based on Generative Adversarial Network (GAN) for solving jigsaw puzzles. The problem assumes that an image is divided into equal square pieces, and asks to recover the image according to information provided by the pieces. Conventional jigsaw puzzle solvers often determine the relationships based on the boundaries of...
We present an unsupervised optical flow estimation method by proposing an adaptive pyramid sampling in the deep pyramid network. Specifically, in the pyramid downsampling, we propose a Content-Aware Pooling (CAP) module, which promotes local feature gathering by avoiding cross region pooling, so that the learned features become more representative....
Panorama images have a much larger field-of-view thus naturally encode enriched scene context information compared to standard perspective images, which however is not well exploited in the previous scene understanding methods. In this paper, we propose a novel method for panoramic 3D scene understanding which recovers the 3D room layout and the sh...
Mobile captured images can be aligned using their gyroscope sensors. Optical image stabilizer (OIS) terminates this possibility by adjusting the images during the capturing. In this work, we propose a deep network that compensates for the motions caused by the OIS, such that the gyroscopes can be used for image alignment on the OIS cameras. To achi...
Occlusion is an inevitable and critical problem in unsupervised optical flow learning. Existing methods either treat occlusions equally as non-occluded regions or simply remove them to avoid incorrectness. However, the occlusion regions can provide effective information for optical flow learning. In this paper, we present OIFlow, an occlusion-inpai...
The channel redundancy of convolutional neural networks (CNNs) results in the large consumption of memories and computational resources. In this work, we design a novel Slim Convolution (SlimConv) module to boost the performance of CNNs by reducing channel redundancies. Our SlimConv consists of three main steps:
Reconstruct
,
Transform
, and
F...
In this paper, we introduce a novel point-to-surface representation for 3D point cloud learning. Unlike the previous methods that mainly adopt voxel, mesh, or point coordinates, we propose to tackle this problem from a new perspective: learn a set of quadratic terms based static and global reference surfaces to describe 3D shapes, such that the coo...
Data association is important in the point cloud registration. In this work, we propose to solve the partial-to-partial registration from a new perspective, by introducing feature interactions between the source and the reference clouds at the feature extraction stage, such that the registration can be realized without the explicit mask estimation...
We present a new pipeline for holistic 3D scene understanding from a single image, which could predict object shape, object pose, and scene layout. As it is a highly ill-posed problem, existing methods usually suffer from inaccurate estimation of both shapes and layout especially for the cluttered scene due to the heavy occlusion between objects. W...
Point cloud registration is a key task in many computational fields. Previous correspondence matching based methods require the point clouds to have distinctive geometric structures to fit a 3D rigid transformation according to point-wise sparse feature matches. However, the accuracy of transformation heavily relies on the quality of extracted feat...
The paper proposes a method to effectively fuse multi-exposure inputs and generates high-quality high dynamic range (HDR) images with unpaired datasets. Deep learning-based HDR image generation methods rely heavily on paired datasets. The ground truth provides information for the network getting HDR images without ghosting. Datasets without ground...
The paper proposes a solution based on Generative Adversarial Network (GAN) for solving jigsaw puzzles. The problem assumes that an image is cut into equal square pieces, and asks to recover the image according to pieces information. Conventional jigsaw solvers often determine piece relationships based on the piece boundaries, which ignore the impo...
The paper proposes a solution to effectively handle salient regions for style transfer between unpaired datasets. Recently, Generative Adversarial Networks (GAN) have demonstrated their potentials of translating images from source domain
${X}$
to target domain
${Y}$
in the absence of paired examples. However, such a translation cannot guarantee...
In this letter, we compose a new three-stage deep convolutional neural network (NTSDCN) for image demosaicking, and it consists of our proposed Laplacian energy-constrained local residual unit (LC-LRU) and a feature-guided prior fusion unit (FG-PFU). Specifically, the LC-LRU is used to refine the learning target of the specific residual blocks in t...
Single image deraining regards an input image as a fusion of a background image, a transmission map, rain streaks, and atmosphere light. While advanced models are proposed for image restoration (i.e., background image generation), they regard rain streaks with the same properties as background rather than transmission medium. As vapors (i.e., rain...
Facial Expression Recognition (FER) is a challenging yet important research topic owing to its significance with respect to its academic and commercial potentials. In this work, we propose an oriented attention pseudo-siamese network that takes advantage of global and local facial information for high accurate FER. Our network consists of two branc...
Quality enhancement of HEVC compressed videos has attracted a lot of attentions in recent years. In this paper, we propose a robust multi-frame guided attention network (MGANet) to reconstruct high-quality frames based on HEVC compressed videos. In our network, we first use an advanced motion flow algorithm to estimate the motion information of inp...
Removing rain effects from an image is of importance for various applications such as autonomous driving, drone piloting, and photo editing. Conventional methods rely on some heuristics to handcraft various priors to remove or separate the rain effects from an image. Recent deep learning models are proposed to learn end-to-end methods to complete t...
Single image deraining regards an input image as a fusion of a background image, a transmission map, rain streaks, and atmosphere light. While advanced models are proposed for image restoration (i.e., background image generation), they regard rain streaks with the same properties as background rather than transmission medium. As vapors (i.e., rain...
3D object detection has become an emerging task in autonomous driving scenarios. Previous works process 3D point clouds using either projection-based or voxel-based models. However, both approaches contain some drawbacks. The voxel-based methods lack semantic information, while the projection-based methods suffer from numerous spatial information l...
Accurate 3D object detection from point clouds has become a crucial component in autonomous driving. However, the volumetric representations and the projection methods in previous works fail to establish the relationships between the local point sets. In this paper, we propose Sparse Voxel-Graph Attention Network (SVGA-Net), a novel end-to-end trai...
Physically based rendering has been widely used to generate photo-realistic images, which greatly impacts industry by providing appealing rendering, such as for entertainment and augmented reality, and academia by serving large scale high-fidelity synthetic training data for data hungry methods like deep learning. However, physically based renderin...
Multiple description coding (MDC) is an efficient source coding technique for error-prone transmission over multiple channels. In this paper, we focus on the design of a new polyphase down-sampling based MDC (NPDS-MDC) for image signals. The encoding of our proposed NPDS-MDC consists of three steps. First, we perform down-sampling on each
${N} \ti...
Removing rain streaks from a single image continues to draw attentions today in outdoor vision systems. In this paper, we present an efficient method to remove rain streaks. First, the location map of rain pixels needs to be known as precisely as possible, to which we implement a relatively accurate detection of rain streaks by utilizing two charac...
The channel redundancy in feature maps of convolutional neural networks (CNNs) results in the large consumption of memories and computational resources. In this work, we design a novel Slim Convolution (SlimConv) module to boost the performance of CNNs by reducing channel redundancies. Our SlimConv consists of three main steps: Reconstruct, Transfo...
Light-field raw data captured by a state-of-the-art light-field camera is limited in its spatial and angular resolutions due to the camera’s optical hardware. In this paper, we propose an all-software algorithm to synthesize light-field raw data from a single RGB-D input image, which is driven largely by the need in the research area of light-field...
The unprecedented performance achieved by deep convolutional neural networks for image classification is linked primarily to their ability of capturing rich structural features at various layers within networks. Here we design a series of experiments, inspired by children's learning of the arithmetic addition of two integers, to showcase that such...
We present a new deep point cloud rendering pipeline through multi-plane projections. The input to the network is the raw point cloud of a scene and the output are image or image sequences from a novel view or along a novel camera trajectory. Unlike previous approaches that directly project features from 3D points onto 2D image domain, we propose t...
In recent years, deep learning based methods have made significant progress in rain-removing. However, the existing methods usually do not have good generalization ability, which leads to the fact that almost all of existing methods have a satisfied performance on removing a specific type of rain streaks, but may have a relatively poor performance...
In this paper, we focus on the intrinsic image decomposition problem for stereoscopic image pairs. The existing methods cannot be applied directly to decompose stereoscopic images, as it often produces inconsistent reflectance (albedo) and 3D artifacts after the decomposition. We propose a straightforward yet effective framework that enables a high...
Rain removal in images/videos is still an important task in computer vision field and attracting attentions of more and more people. Traditional methods always utilize some incomplete priors or filters (e.g. guided filter) to remove rain effect. Deep learning gives more probabilities to better solve this task. However, they remove rain either by ev...