May 2024
·
5 Reads
Neurocomputing
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
May 2024
·
5 Reads
Neurocomputing
October 2023
·
118 Reads
·
38 Citations
International Journal of Computer Vision
This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder–decoder framework and introduce SegViTv2 . In this study, we introduce a novel Attention-to-Mask (ATM) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms popular decoder UPerNet using various ViT backbones while consuming only about 5 % of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk ++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to 50 % while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit .
June 2023
·
118 Reads
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework and introduce SegViTv2. In our work, we implement the decoder with the global attention mechanism inherent in ViT backbones and propose the lightweight Attention-to-Mask module that effectively converts the global attention map into semantic masks for high-quality segmentation results. Our decoder can outperform the most commonly-used decoder UpperNet in various ViT backbones while consuming only about 5\% of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to while maintaining competitive performance. Furthermore, due to the flexibility of our ViT-based architecture, SegVit can be easily extended to semantic segmentation under the setting of continual learning, achieving nearly zero forgetting. Experiments show that our proposed SegViT outperforms recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: \url{https://github.com/zbwxp/SegVit}.
February 2023
·
564 Reads
The deployment of 3D detectors strikes one of the major challenges in real-world self-driving scenarios. Existing BEV-based (i.e., Bird Eye View) detectors favor sparse convolutions (known as SPConv) to speed up training and inference, which puts a hard barrier for deployment, especially for on-device applications. In this paper, to tackle the challenge of efficient 3D object detection from an industry perspective, we devise a deployment-friendly pillar-based 3D detector, termed FastPillars. First, we introduce a novel lightweight Max-and-Attention Pillar Encoding (MAPE) module specially for enhancing small 3D objects. Second, we propose a simple yet effective principle for designing a backbone in pillar-based 3D detection. We construct FastPillars based on these designs, achieving high performance and low latency without SPConv. Extensive experiments on two large-scale datasets demonstrate the effectiveness and efficiency of FastPillars for on-device 3D detection regarding both performance and speed. Specifically, FastPillars delivers state-of-the-art accuracy on Waymo Open Dataset with 1.8X speed up and 3.8 mAPH/L2 improvement over CenterPoint (SPConv-based). Our code is publicly available at: https://github.com/StiphyJay/FastPillars.
November 2022
·
48 Reads
·
87 Citations
Lecture Notes in Computer Science
We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods. Code is available at: https://github.com/aim-uofa/Poseur. Keywords2d human pose estimationKeypoint detectionTransformer
October 2022
·
190 Reads
·
2 Citations
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to computations while maintaining competitive performance.
August 2022
·
35 Reads
Existing matching-based approaches perform video object segmentation (VOS) via retrieving support features from a pixel-level memory, while some pixels may suffer from lack of correspondence in the memory (i.e., unseen), which inevitably limits their segmentation performance. In this paper, we present a Two-Stream Network (TSN). Our TSN includes (i) a pixel stream with a conventional pixel-level memory, to segment the seen pixels based on their pixellevel memory retrieval. (ii) an instance stream for the unseen pixels, where a holistic understanding of the instance is obtained with dynamic segmentation heads conditioned on the features of the target instance. (iii) a pixel division module generating a routing map, with which output embeddings of the two streams are fused together. The compact instance stream effectively improves the segmentation accuracy of the unseen pixels, while fusing two streams with the adaptive routing map leads to an overall performance boost. Through extensive experiments, we demonstrate the effectiveness of our proposed TSN, and we also report state-of-the-art performance of 86.1% on YouTube-VOS 2018 and 87.5% on the DAVIS-2017 validation split.
May 2022
·
213 Reads
·
5 Citations
We present a simple yet effective fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes, termed FCOS-LiDAR. Unlike the dominant methods that use the bird-eye view (BEV), our proposed detector detects objects from the range view (RV, a.k.a. range image) of the LiDAR points. Due to the range view's compactness and compatibility with the LiDAR sensors' sampling process on self-driving cars, the range view-based object detector can be realized by solely exploiting the vanilla 2D convolutions, departing from the BEV-based methods which often involve complicated voxelization operations and sparse convolutions. For the first time, we show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors while being significantly faster and simpler. More importantly, almost all previous range view-based detectors only focus on single-frame point clouds, since it is challenging to fuse multi-frame point clouds into a single range view. In this work, we tackle this challenging issue with a novel range view projection mechanism, and for the first time demonstrate the benefits of fusing multi-frame point clouds for a range-view based detector. Extensive experiments on nuScenes show the superiority of our proposed method and we believe that our work can be strong evidence that an RV-based 3D detector can compare favourably with the current mainstream BEV-based detectors.
January 2022
·
30 Reads
·
64 Citations
IEEE Transactions on Pattern Analysis and Machine Intelligence
We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditional convolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically follow the paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose to attend to the instances with dynamic conditional convolutions. Instead of using instance-wise ROIs as inputs to the instance mask head of fixed weights, we design dynamic instance-aware mask heads, conditioned on the instances to be predicted. CondInst enjoys three advantages: 1) Instance and panoptic segmentation are unified into a fully convolutional network, eliminating the need for ROI cropping and feature alignment. 2) The elimination of the ROI cropping also significantly improves the output instance mask resolution. 3) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference time per instance and making the overall inference time less relevant to the number of instances. We demonstrate a simpler method that can achieve improved accuracy and inference speed on both instance and panoptic segmentation tasks. On the COCO dataset, we outperform a few state-of-the-art methods. We hope that CondInst can be a strong baseline for instance and panoptic segmentation. Code is available at: https://git.io/AdelaiDet .
January 2022
·
171 Reads
·
1 Citation
We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.
... In object detection, models such as DETR (Detection Transformer) have made the detection pipeline simpler by dispensing with the requirement for hand-engineered parts and have shown competitive performance [28]. ViTs have also been successfully used in semantic segmentation, enhancing the accuracy of pixel-level classification tasks [29]. In addition, the generalization of ViTs has expanded their use in the video context for video understanding, whereby ViTs process the sequential frames to capture the spatial and temporal information to achieve goals like action recognition [30]. ...
October 2023
International Journal of Computer Vision
... Recent human pose estimation methods [51,52,72] explore transformer-based architectures [2,4,77,95] due to their sparse, end-to-end design and promising performance. These methods treat human pose estimation as a direct set prediction problem and use bipartite matching to establish one-to-one instance correspondence during training. ...
November 2022
Lecture Notes in Computer Science
... The dot product in feature integration reduces the dimensionality of the output of ViT. Since the dot product is highly sensitive to the scale of the feature values, it can also merge multiple vectors into a scalar or a shorter vector, and the values with opposite signs will be canceled out, resulting in information loss [29,30]. The transformation of the integration strategy of feature vectors refined the model architecture. ...
October 2022
... CornerNet (Law and Deng, 2018) and ExtremeNet (Zhou et al., 2019b) use the keypoint-based detection technique, which identifies the target's upper-left and lower-right corner points and then combines the corner points to produce a detection frame. FSAF (Zhu et al., 2019), FCOS (Tian et al., 2022), ...
May 2022
... SOLO achieves an AP of 41.7, performing closely to the Baseline. Other methods such as HTC [32], PointRend, SOLOv2, Mask R-CNN, condinst [33], and rtmdet [34] have AP values of 39.8, 37.5, 38.4, 31.7, ...
January 2022
IEEE Transactions on Pattern Analysis and Machine Intelligence
... learning-based methods. Human pose estimation methods can be divided into two categories: heatmap-based methods [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23] and regression-based methods [24], [25], [26], [27], [28], [29], [30]. Compared with regression-based methods, heatmap-based methods can preserve spatial position information to provide richer supervision, resulting in a smoother training process and higher accuracy. ...
January 2022
... Some studies focus on a similar task, HBox-to-Mask: 1) SDI [69] refines the segmentation through an iterative training process; 2) BBTP [70] formulates the HBox-supervised instance segmentation into a multiple-instance learning problem based on Mask R-CNN [71]; 3) BoxInst [72] uses the colorpairwise affinity with box constraint under an efficient RoIfree CondInst [73]; 4) BoxLevelSet [74] introduces an energy function to predict the instance-aware mask as the level set; 5) SAM (Segment Anything Model) [43] produces object masks from input Point/HBox prompts. Though RBoxes can be obtained from the segmentation mask by finding the minimum circumscribed rectangle, we show that such a cascade pipeline can be less cost-efficient (see Sec. IV). ...
June 2021
... Furthermore, these two-stage approaches suffer from non-differentiable, hand-crafted post-processing steps that challenge optimization. Inspired by the one-stage object detectors [21,76], pixel-wise regression methods [49,53,55,64,73,75,78,81,93] densely predict pose candidates in an end-to-end fashion and apply Non-maximum Suppression (NMS) to obtain poses for different individuals. However, these methods produce redundant results, challenging the removal of duplicates. ...
June 2021
... To further validate the performance of the InSAR-YOLOv8 model, this study trained the Faster-R-CNN 50 , RTMDet 73 , Double-head-faster-RCNN 74 , YOLOv3, Nas-fcos 75 , and YOLOvX 76 models under the identical conditions. Among them, Faster-R-CNN and YOLOv3 represent classic two-stage and one-stage detection models, respectively, and have been widely applied across various detection tasks. ...
December 2021
International Journal of Computer Vision
... Although convolutional neural networks dominate the aforementioned research, the superior performance of Transformers in natural language processing based on the self-attention mechanism has attracted significant attention from the computer vision community. The exceptional performance of Transformers 33 in natural language processing has led to the development of numerous Transformerbased methodologies [34][35][36][37][38] for tackling high-level vision tasks, including image classification, 39,40 object detection, 41,42 and segmentation. 43,44 Specifically, Liang et al. 45 introduced a novel Transformer model for image restoration tasks. ...
April 2021