Chao Ma's research while affiliated with Beijing Forestry University and other places

Publications (90)

Chapter
Real-time and high-performance 3D object detection is of critical importance for autonomous driving. Recent top-performing 3D object detectors mainly rely on point-based or 3D voxel-based convolutions, which are both computationally inefficient for onboard deployment. In contrast, pillar-based methods use solely 2D convolutions, which consume less...
Chapter
Transformer trackers have achieved impressive advancements recently, where the attention mechanism plays an important role. However, the independent correlation computation in the attention mechanism could result in noisy and ambiguous attention weights, which inhibits further performance improvement. To address this issue, we propose an attention...
Preprint
Deep learning models have shown their vulnerability when dealing with adversarial attacks. Existing attacks almost perform on low-level instances, such as pixels and super-pixels, and rarely exploit semantic clues. For face recognition attacks, existing methods typically generate the l_p-norm perturbations on pixels, however, resulting in low attac...
Preprint
Full-text available
Due to the difficulty in collecting paired real-world training data, image deraining is currently dominated by supervised learning with synthesized data generated by e.g., Photoshop rendering. However, the generalization to real rainy scenes is usually limited due to the gap between synthetic and real-world data. In this paper, we first statistical...
Preprint
Full-text available
High-speed, high-resolution stereoscopic (H2-Stereo) video allows us to perceive dynamic 3D content at fine granularity. The acquisition of H2-Stereo video, however, remains challenging with commodity cameras. Existing spatial super-resolution or temporal frame interpolation methods provide compromised solutions that lack temporal or spatial detail...
Article
Exploring and fabricating smart actuating materials that strike the perfect balance between humidity response and mechanical integrity (especially wet tensile strength) via reasonable structural design and simple yet low-cost preparation is critical for biomimetic devices, soft robotics, artificial muscles and generators, but it remains challenging...
Preprint
Transformer trackers have achieved impressive advancements recently, where the attention mechanism plays an important role. However, the independent correlation computation in the attention mechanism could result in noisy and ambiguous attention weights, which inhibits further performance improvement. To address this issue, we propose an attention...
Preprint
Full-text available
Recent RGB-D semantic segmentation has motivated research interest thanks to the accessibility of complementary modalities from the input side. Existing works often adopt a two-stream architecture that processes photometric and geometric information in parallel, with few methods explicitly leveraging the contribution of depth cues to adjust the sam...
Article
The explosive growth of image data facilitates the fast development of image processing and computer vision methods for emerging visual applications, meanwhile introducing novel distortions to the processed images. This poses a grand challenge to existing blind image quality assessment (BIQA) models, which are weak at adapting to subpopulation shif...
Preprint
Full-text available
Real-time and high-performance 3D object detection is of critical importance for autonomous driving. Recent top-performing 3D object detectors mainly rely on point-based or 3D voxel-based convolutions, which are both computationally inefficient for onboard deployment. In contrast, pillar-based methods use merely 2D convolutions, which consume less...
Preprint
Various facial manipulation techniques have drawn serious public concerns in morality, security, and privacy. Although existing face forgery classifiers achieve promising performance on detecting fake images, these methods are vulnerable to adversarial examples with injected imperceptible perturbations on the pixels. Meanwhile, many face forgery de...
Preprint
Learning a dense 3D model with fine-scale details from a single facial image is highly challenging and ill-posed. To address this problem, many approaches fit smooth geometries through facial prior while learning details as additional displacement maps or personalized basis. However, these techniques typically require vast datasets of paired multi-...
Article
High-speed, high-resolution stereoscopic (H $^{2}$ -Stereo) video allows us to perceive dynamic 3D content at fine granularity. The acquisition of H $^{2}$ -Stereo video, however, remains challenging with commodity cameras. Existing spatial super-resolution or temporal frame interpolation methods provide compromised solutions that lack temporal or...
Preprint
Recent RGBD-based models for saliency detection have attracted research attention. The depth clues such as boundary clues, surface normal, shape attribute, etc., contribute to the identification of salient objects with complicated scenarios. However, most RGBD networks require multi-modalities from the input side and feed them separately through a...
Chapter
Unsupervised visual tracking has received increasing attention recently. Existing unsupervised visual tracking methods mainly exploit the cycle consistency of sequential images to learn an unsupervised representation for target objects. Due to the small appearance changes between consecutive images, existing unsupervised deep trackers compute the c...
Preprint
In this paper, we propose to learn an Unsupervised Single Object Tracker (USOT) from scratch. We identify that three major challenges, i.e., moving object discovery, rich temporal variation exploitation, and online update, are the central causes of the performance bottleneck of existing unsupervised trackers. To narrow the gap between unsupervised...
Preprint
Full-text available
Estimating 3D human pose and shape from a single image is highly under-constrained. To address this ambiguity, we propose a novel prior, namely kinematic dictionary, which explicitly regularizes the solution space of relative 3D rotations of human joints in the kinematic tree. Integrated with a statistical human model and a deep neural network, our...
Preprint
Adversarial attack arises due to the vulnerability of deep neural networks to perceive input samples injected with imperceptible perturbations. Recently, adversarial attack has been applied to visual object tracking to evaluate the robustness of deep trackers. Assuming that the model structures of deep trackers are known, a variety of white-box att...
Preprint
Full-text available
The explosive growth of image data facilitates the fast development of image processing and computer vision methods for emerging visual applications, meanwhile introducing novel distortions to the processed images. This poses a grand challenge to existing blind image quality assessment (BIQA) models, failing to continually adapt to such subpopulati...
Article
In the above article [1] , the title is incorrect. The correct title is “End-to-End Learning Deep CRF Models for Multi-Object Tracking.”
Article
Full-text available
The advancement of visual tracking has continuously been brought by deep learning models. Typically, supervised learning is employed to train these models with expensive labeled data. In order to reduce the workload of manual annotation and learn to track arbitrary objects, we propose an unsupervised learning method for visual tracking. The motivat...
Article
Full-text available
Existing tracking-by-detection approaches using deep features have achieved promising results in recent years. However, these methods mainly exploit feature representations learned from individual static frames, thus paying little attention to the temporal smoothness between frames. This easily leads trackers to drift in the presence of large appea...
Article
In this paper, we address the issue of data imbalance in learning deep models for visual object tracking. Although it is well known that data distribution plays a crucial role in learning and inference models, considerably less attention has been paid to data imbalance in visual tracking. To balance training data, we propose a novel shrinkage loss...
Preprint
Full-text available
The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments according to the given language instruction. The main challenges of VLN arise mainly from two aspects: first, the agent needs to attend to the meaningful paragraphs of the language instruction...
Article
Full-text available
The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments according to the given language instruction. The main challenges of VLN arise mainly from two aspects: first, the agent needs to attend to the meaningful paragraphs of the language instruction...
Chapter
Single image deraining regards an input image as a fusion of a background image, a transmission map, rain streaks, and atmosphere light. While advanced models are proposed for image restoration (i.e., background image generation), they regard rain streaks with the same properties as background rather than transmission medium. As vapors (i.e., rain...
Chapter
While deep convolutional neural networks (CNNs) are vulnerable to adversarial attacks, considerably few efforts have been paid to construct robust deep tracking algorithms against adversarial attacks. Current studies on adversarial attack and defense mainly reside in a single image. In this work, we first attempt to generate adversarial examples on...
Chapter
Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the major tricks for DNN, has been widely used in many computer vision tasks. However, there are few works studying the data augmentation problem for VQA and none of the existin...
Chapter
There are two main lines of research on visual question answering (VQA): compositional model with explicit multi-hop reasoning, and monolithic network with implicit reasoning in the latent feature space. The former excels in interpretability and compositionality but fails on real-world images, while the latter usually achieves better performance du...
Preprint
There are two main lines of research on visual question answering (VQA): compositional model with explicit multi-hop reasoning, and monolithic network with implicit reasoning in the latent feature space. The former excels in interpretability and compositionality but fails on real-world images, while the latter usually achieves better performance du...
Preprint
Full-text available
In this paper, we focus on exploring the fusion of images and point clouds for 3D object detection in view of the complementary nature of the two modalities, i.e., images possess more semantic information while point clouds specialize in distance sensing. To this end, we present a novel two-stage multi-modal fusion network for 3D object detection,...
Preprint
Single image deraining regards an input image as a fusion of a background image, a transmission map, rain streaks, and atmosphere light. While advanced models are proposed for image restoration (i.e., background image generation), they regard rain streaks with the same properties as background rather than transmission medium. As vapors (i.e., rain...
Preprint
The advancement of visual tracking has continuously been brought by deep learning models. Typically, supervised learning is employed to train these models with expensive labeled data. In order to reduce the workload of manual annotations and learn to track arbitrary objects, we propose an unsupervised learning method for visual tracking. The motiva...
Preprint
While deep convolutional neural networks (CNNs) are vulnerable to adversarial attacks, considerably few efforts have been paid to construct robust deep tracking algorithms against adversarial attacks. Current studies on adversarial attack and defense mainly reside in a single image. In this work, we first attempt to generate adversarial examples on...
Preprint
Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the major tricks for DNN, has been widely used in many computer vision tasks. However, there are few works studying the data augmentation problem for VQA and none of the existin...
Article
Correlation filters (CF) have received considerable attention in visual tracking because of their computational efficiency. Leveraging deep features via off-the-shelf CNN models (e.g., VGG), CF trackers achieve state-of-the-art performance while consuming a large number of computing resources. This limits deep CF trackers to be deployed to many mob...
Preprint
Full-text available
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that...
Article
Facilitated by deep neural networks, numerous tracking methods have made significant advances. Existing deep trackers mainly utilize independent frames to model the target appearance, while paying less attention to its temporal coherence. In this paper, we propose a recurrent memory activation network (RMAN) to exploit the untapped temporal coheren...
Preprint
Different rain models and novel network structures have been proposed to remove rain streaks from single rainy images. In this work, we bring attention to the intrinsic priors and multi-scale features of the rainy images, and develop several intrinsic loss functions to train a CNN deraining network. We first study the sparse priors of rainy images,...
Article
Full-text available
In this paper, we propose an adaptive region proposal scheme with feature channel regularization to facilitate robust object tracking. We consider tracking as a linear regression problem and an ensemble of correlation filters is trained online to distinguish the foreground target from the background. Further, we integrate adaptively learned region...
Preprint
Correlation filters (CF) have received considerable attention in visual tracking because of their computational efficiency. Leveraging deep features via off-the-shelf CNN models (e.g., VGG), CF trackers achieve state-of-the-art performance while consuming a large number of computing resources. This limits deep CF trackers to be deployed to many mob...
Preprint
Rain removal in images/videos is still an important task in computer vision field and attracting attentions of more and more people. Traditional methods always utilize some incomplete priors or filters (e.g. guided filter) to remove rain effect. Deep learning gives more probabilities to better solve this task. However, they remove rain either by ev...
Conference Paper
Full-text available
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that...
Preprint
Existing deep trackers mainly use convolutional neural networks pre-trained for generic object recognition task for representations. Despite demonstrated successes for numerous vision tasks, the contributions of using pre-trained deep features for visual tracking are not as significant as that for object recognition. The key issue is that in visual...
Preprint
Full-text available
We propose an unsupervised visual tracking method in this paper. Different from existing approaches using extensive annotated data for supervised learning, our CNN model is trained on large-scale unlabeled videos in an unsupervised manner. Our motivation is that a robust tracker should be effective in both the forward and backward predictions (i.e....
Preprint
Full-text available
Video frame interpolation aims to synthesize nonexistent frames in-between the original frames. While significant advances have been made from the recent deep convolutional neural networks, the quality of interpolation is often reduced due to large object motion or occlusion. In this work, we propose a video frame interpolation method which explici...
Article
Existing trackers often decompose the task of visual tracking into multiple independent components, such as target appearance sampling, classifier learning, and target state inferring. In this paper, we present a transform-aware attentive tracking framework, which uses a deep attentive network to directly predict the target states via spatial trans...
Preprint
Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data. Recently, significant efforts have been made to exploit attention schemes to advance computer vision systems. For visual tracking, it is often challenging to track target objects undergoing large appearance changes....
Article
Full-text available
Object tracking is challenging as target objects often undergo drastic appearance changes over time. Recently, adaptive correlation filters have been successfully applied to object tracking. However, tracking algorithms relying on highly adaptive correlation filters are prone to drift due to noisy updates. Moreover, as these algorithms do not maint...
Article
Full-text available
The tracking-by-detection framework consists of two stages, i.e., drawing samples around the target object in the first stage and classifying each sample as the target object or as background in the second stage. The performance of existing trackers using deep classification networks is limited by two aspects. First, the positive samples in each fr...