Shengping Zhang

Shengping Zhang
Harbin Institute of Technology | HIT · Department of Computer Science and Technology

PhD

About

135
Publications
33,092
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,287
Citations
Additional affiliations
February 2013 - January 2014
Brown University
Position
  • PostDoc Position
June 2012 - January 2013
Brown University
Position
  • Visiting Research Fellow
September 2011 - June 2012
University of California, Berkeley
Position
  • Visiting student researcher
Education
September 2008 - December 2012
Harbin Institute of Technology
Field of study
  • Computer Vision
September 2006 - July 2008
Harbin Institute of Technology
Field of study
  • Computer Vision

Publications

Publications (135)
Article
Successful video deblurring relies on effectively using sharp pixels from other frames to recover the blurry pixels of the current frame. However, mainstream methods only use estimated optical flows to align and fuse features from adjacent frames without considering the pixel-wise blur levels, leading to the introduction of blurry pixels from adjac...
Article
Recently, diffusion models have significantly improved the performance of Camouflaged Object Detection (COD) by adding noise to a mask and iteratively denoising it to match the target distributions. Due to the direct extraction of features from noisy masks and the lack of conditional constraints on a prediction area, the diffusion model may deviate...
Preprint
Differentiable rendering techniques have recently shown promising results for free-viewpoint video synthesis of characters. However, such methods, either Gaussian Splatting or neural implicit rendering, typically necessitate per-subject optimization which does not meet the requirement of real-time rendering in an interactive application. We propose...
Article
Full-text available
Semantic scene completion (SSC) aims to simultaneously perform scene completion (SC) and predict semantic categories of a 3D scene from a single depth and/or RGB image. Most existing SSC methods struggle to handle complex regions with multiple objects close to each other, especially for objects with reflective or dark surfaces. This primarily stems...
Article
Recently, RGB-T tracking methods have made significant progress, demonstrating remarkable capabilities in addressing the complexities of tracking tasks within demanding environments. However, these methods overlook instability of modal validity in real-world scenarios. This limits the model’s ability to understand the correlation between modalities...
Article
Full-text available
Face reshaping aims to adjust the shape of a face in a portrait image to make the face aesthetically beautiful, which has many potential applications. Existing methods 1) operate on the pre-defined facial landmarks, leading to artifacts and distortions due to the limited number of landmarks, 2) synthesize new faces based on segmentation masks or sk...
Conference Paper
Full-text available
We present a new approach, termed GPS-Gaussian, for synthesizing novel views of a character in a real-time manner. The proposed method enables 2K-resolution rendering under a sparse-view camera setting. Unlike the original Gaussian Splatting or neural implicit rendering methods that necessitate per-subject optimizations, we introduce Gaussian param...
Article
Face swapping aims to transfer the identity of a source face to a target face image while preserving the target attributes (e.g., facial expression, head pose, illumination, and background). Most existing methods use a face recognition model to extract global features from the source face and directly fuse them with the target to generate a swappin...
Article
Blind Face Super-Resolution (BFSR) has recently gained widespread attention, which aims to super-resolve Low-Resolution (LR) face images with complex unknown degradation to High-Resolution (HR) face images. However, existing BFSR methods suffer from two major limitations. First, most of them are trained on synthetic degradation data pairs with pre-...
Article
Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each im...
Article
In this paper, we propose SpectralNeRF, an end-to-end Neural Radiance Field (NeRF)-based architecture for high-quality physically based rendering from a novel spectral perspective. We modify the classical spectral rendering into two main steps, 1) the generation of a series of spectrum maps spanning different wavelengths, 2) the combination of thes...
Article
How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the when...
Article
Existing human matting methods are incapable of accurately estimating the alpha mattes of arbitrarily selected humans from a group photo. An alternative solution is to apply them to the corresponding cropped image patches, which however obtains inaccurate alpha estimation due to the interference of the body parts of the neighboring humans. In addit...
Article
Full-text available
Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for spec...
Article
Full-text available
The knee has gradually become an important research target for the lower extremity exoskeleton. However, the issue that whether the flexion-assisted profile based on the contractile element (CE) is effective throughout the gait is still a research gap. In this study, we first analyze the effective flexion-assisted method through the passive element...
Article
Both the discrete motion states and continuous joint kinematics are essential for controlling assistive robots under changeable environmental conditions. However, few studies investigate both the discrete and continuous motion intents. This paper is the first work to propose an end-to-end motion intent decoding method that integrates recognition of...
Article
Full-text available
Point cloud completion aims to estimate the missing shape from a partial point cloud. Existing encoder-decoder based generative models usually reconstruct the complete point cloud from the learned distribution of the shape prior, which may lead to distortion of geometric details (such as sharp structures and structures without smooth surfaces) due...
Preprint
Full-text available
For natural image matting, context information plays a crucial role in estimating alpha mattes especially when it is challenging to distinguish foreground from its background. Exiting deep learning-based methods exploit specifically designed context aggregation modules to refine encoder features. However, the effectiveness of these modules has not...
Article
Transformer has achieved impressive progress in visual tracking due to their capability of global modeling, which enables them to learn low-frequency features(i.e., high-level semantic information). However, it seems to overlook the high-frequency features(i.e., low-level texture and edge information) which are crucial to identify different intra-c...
Article
Most of the existing bounding box-based trackers rely on a classification subnetwork and a regression subnetwork to predict the location and scale of the bounding box. They learn the classification subnetwork by processing each sample individually and applying the suggested classification confidence to produce the final prediction. They typically i...
Article
Deep learning-based image compressive sensing (CS) methods have achieved great success in the past few years. However, most of them are content-independent, with a spatially uniform sampling rate allocation for the entire image. Such practises may potentially degrade the performance of image CS with block-based sampling, since the content of differ...
Article
Human instance matting aims to estimate an alpha matte for each human instance in an image, which is extremely challenging and has rarely been studied so far. Despite some efforts to use instance segmentation to generate a trimap for each instance and apply trimap-based matting methods, the resulting alpha mattes are often inaccurate due to inaccur...
Article
Existing makeup transfer methods typically transfer simple makeup colors in a well-conditioned face image and fail to handle makeup style details (e.g., complicated colors and shapes) and facial occlusion. To address these problems, this paper proposes Hybrid Transformers with Attention-guided Spatial Embeddings (named HT-ASE) for makeup transfer a...
Article
Variation of scales or aspect ratios has been one of the main challenges for tracking. To overcome this challenge, most existing methods adopt either multi-scale search or anchor-based schemes, which use a predefined search space in a handcrafted way and therefore limit their performance in complicated scenes. To address this problem, recent anchor...
Article
Backlit images are usually taken when the light source is opposite to the camera. The uneven exposure (e.g., underexposure on the foreground and overexposure on the background) makes the backlit images more challenging than general image enhancement tasks that only need to increase or decrease the exposure on the whole images. Compared to tradition...
Article
Full-text available
Background Recently, the combination of deep learning and time-lapse imaging provides an objective, standard and scientific solution for embryo selection. However, the reported studies were based on blastocyst formation or clinical pregnancy as the end point. To the best of our knowledge, there is no predictive model that uses the outcome of live b...
Article
Assisting human locomotion is essentially related to the assistive force profile, which can be determined from four aspects: timing, magnitude, shape and duration. Most current methods of decoding human motor intent enable the customized determination of the assistive force profile by providing information of different subsets of the four aspects....
Preprint
Full-text available
Natural image matting estimates the alpha values of unknown regions in the trimap. Recently, deep learning based methods propagate the alpha values from the known regions to unknown regions according to the similarity between them. However, we find that more than 50\% pixels in the unknown regions cannot be correlated to pixels in known regions due...
Article
Low-light images enhancement is a challenging task because enhancing image brightness and reducing image degradation should be considered simultaneously. Although existing deep learning-based methods improve the visibility of low-light images, many of them tend to lose details or sacrifice naturalness. To address these issues, we present a multi-st...
Article
Inferring the complete 3D shape of an object from an RGB image has shown impressive results, however, existing methods rely primarily on recognizing the most similar 3D model from the training set to solve the problem. These methods suffer from poor generalization and may lead to low-quality reconstructions for unseen objects. Nowadays, stereo came...
Article
The fast-growing techniques of measuring and fusing multi-modal biomedical signals enable advanced motor intent decoding schemes of lower-limb exoskeletons, meeting the increasing demand for rehabilitative or assistive applications of take-home healthcare. Challenges of exoskeletons’ motor intent decoding schemes remain in making a continuous predi...
Preprint
Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit the information from the memory by global-to-global matching between the current and past frames, whi...
Preprint
Full-text available
The fast-growing techniques of measuring and fusing multi-modal biomedical signals enable advanced motor intent decoding schemes of lowerlimb exoskeletons, meeting the increasing demand for rehabilitative or assistive applications of take-home healthcare. Challenges of exoskeletons motor intent decoding schemes remain in making a continuous predict...
Chapter
Full-text available
In contrast to images taken on land scenes, images taken over water are more prone to degradation due to the influence of the haze. However, existing image dehazing methods are mainly developed for land scenes and perform poorly when applied to overwater images. To address this problem, we collect the first overwater image dehazing dataset and prop...
Article
Sketch recognition remains a significant challenge due to the limited training data and the substantial intra-class variance of freehand sketches for the same object. Conventional methods for this task often rely on the availability of the temporal order of sketch strokes, additional cues acquired from different modalities and supervised augmentati...
Article
Full-text available
Recovering the 3D shape of an object from single or multiple images with deep neural networks has been attracting increasing attention in the past few years. Mainstream works (e.g. 3D-R2N2) use recurrent neural networks (RNNs) to sequentially fuse feature maps of input images. However, RNN-based approaches are unable to produce consistent reconstru...
Chapter
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To so...
Preprint
Full-text available
Recovering the 3D shape of an object from single or multiple images with deep neural networks has been attracting increasing attention in the past few years. Mainstream works (e.g. 3D-R2N2) use recurrent neural networks (RNNs) to sequentially fuse feature maps of input images. However, RNN-based approaches are unable to produce consistent reconstru...
Preprint
Full-text available
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To so...
Preprint
Scene text recognition with arbitrary shape is very challenging due to large variations in text shapes, fonts, colors, backgrounds, etc. Most state-of-the-art algorithms rectify the input image into the normalized image, then treat the recognition as a sequence prediction task. The bottleneck of such methods is the rectification, which will cause e...
Chapter
Recently, Convolutional Neural Networks (CNNs) have achieved great success in object detection due to their outstanding abilities of learning powerful features on large-scale training datasets. One of the critical factors of their success is the accurate and complete annotation of the training dataset. However, accurately annotating the training da...
Preprint
Full-text available
Inferring the 3D shape of an object from an RGB image has shown impressive results, however, existing methods rely primarily on recognizing the most similar 3D model from the training set to solve the problem. These methods suffer from poor generalization and may lead to low-quality reconstructions for unseen objects. Nowadays, stereo cameras are p...
Article
Given massive video data generated from different applications such as security monitoring and traffic management, to save cost and human labour, developing an industrial intelligent video analytic system, which can automatically extract and analyze the meaningful content of videos, is essential. For achieving the objective of motion perception in...
Preprint
Sketch recognition remains a significant challenge due to the limited training data and the substantial intra-class variance of freehand sketches for the same object. Conventional methods for this task often rely on the availability of the temporal order of sketch strokes, additional cues acquired from different modalities and supervised augmentati...
Conference Paper
Full-text available
Recovering the 3D representation of an object from single-view or multi-view RGB images by deep neural networks has attracted increasing attention in the past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input images sequentially. However, when given the same s...
Preprint
Recovering the 3D representation of an object from single-view or multi-view RGB images by deep neural networks has attracted increasing attention in the past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input images sequentially. However, when given the same s...
Article
In recent years, convolutional neural networks (CNNs) have achieved great success in visual tracking. Most of existing methods train or fine-tune a binary classifier to distinguish the target from its background. However, they may suffer from the performance degradation due to insufficient training data. In this paper, we show that attribute inform...
Article
Full-text available
Hand-crafted and learning-based features are two main types of video representations in the field of video understanding. How to integrate their merits to design good descriptors has been the research hotspot recently. Motivated by TDD (Wang et al. 2015), we combine trajectory pooling method and 3D ConvNets (Tran et al. 2015) and put forward a nove...
Article
Full-text available
Plant identification is a critical step in protecting plant diversity. However, many existing identification systems prohibitively rely on hand-crafted features for plant species identification. In this paper, a deep learning method is employed to extract discriminative features from plant images along with a linear SVM for plant identification. To...
Conference Paper
Full-text available
Traditional image compressed sensing (CS) coding frameworks solve an inverse problem that is based on the measurement coding tools (prediction, quantization, entropy coding, etc.) and the optimization based image reconstruction method. These CS coding frameworks face the challenges of improving the coding efficiency at the encoder, while also suffe...
Article
To intelligently analyze and understand video content, a key step is to accurately perceive the motion of the interested objects in videos. To this end, the task of object tracking, which aims to determine the position and status of the interested object in consecutive video frames, is very important, and has received great research interest in the...
Article
Convolutional neural networks (CNNs) have been applied to visual tracking with demonstrated success in recent years. However, the performance of CNN-based trackers can be further improved, because the predicted upright bounding box cannot tightly enclose the target due to factors such as deformations and rotations. Besides, many existing CNN-based...
Chapter
Hand-crafted and learning-based features are two main types of video representations in the field of video understanding. How to combine their merits to design good descriptors has been the research hotspot recently. Following the idea of TDD [1], in this paper, we investigate if the trajectory pooling method is suitable to 3D ConvNets [2]. Specifi...
Article
Tracking target of interests is an important step for motion perception in intelligent video surveillance systems. While most recently developed tracking algorithms are grounded in RGB image sequences, it should be noted that information from RGB modality is not always reliable (e.g. in a dark environment with poor lighting condition), which urges...
Article
Convolutional Neural Networks (CNNs) have been applied to visual tracking with demonstrated success in recent years. Most CNN-based trackers utilize hierarchical features extracted from a certain layer to represent the target. However, features from a certain layer are not always effective for distinguishing the target object from the backgrounds e...
Article
Sparse coding has been applied to visual tracking and related vision problems with demonstrated success in recent years. Existing tracking methods based on local sparse coding sample patches from a target candidate and sparsely encode these using a dictionary consisting of patches sampled from target template images. The discriminative strength of...
Article
The use of multiple features has been shown to be an effective strategy for visual tracking because of their complementary contributions to appearance modeling. The key problem is how to learn a fused representation from multiple features for appearance modeling. Different features extracted from the same object should share some commonalities in t...
Article
For autonomous driving application, a car shall be able to track objects in the scene in order to estimate where and how they will move such that the tracker embedded in the car can efficiently alert the car for effective collision-avoidance. Traditional discriminative object tracking methods usually train a binary classifier via a support vector m...
Chapter
This paper proposes a novel multi-layered gesture recognition method with Kinect. We explore the essential linguistic characters of gestures: the components concurrent character and the sequential organization character, in a multi-layered framework, which extracts features from both the segmented semantic units and the whole gesture sequence and t...
Article
In this paper, we study one-shot learning gesture recognition on RGB-D data recorded from Microsoft’s Kinect. To this end, we propose a novel bag of manifold words (BoMW) based feature representation on sysmetric positive definite (SPD) manifolds. In particular, we use covariance matrices to extract local features from RGB-D data due to its compact...
Article
Full-text available
In this paper, we propose a novel plant identification method based on multipath sparse coding using SIFT features, which avoids the need of feature engineering and the reliance on botanical taxonomy. In particular, the proposed method uses five paths to model the shape and texture features of plant images, and at each path it learns the dictionari...
Conference Paper
3D mask spoofing attack has been one of the main challenges in face recognition. Among existing methods, texture-based approaches show powerful abilities and achieve encouraging results on 3D mask face anti-spoofing. However, these approaches may not be robust enough in application scenarios and could fail to detect imposters with hyper-real masks....
Conference Paper
Full-text available
The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-...
Technical Report
Full-text available
Recently, sparse representation based visual tracking methods have attracted increasing attention in the computer vision community. Although achieve superior performance to traditional tracking methods, however, a basic problem has not been answered yet — that whether the sparsity constrain is really needed for visual tracking? To answer this quest...