Zhi Tian’s research while affiliated with University of Adelaide and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (49)


Integrating instance-level knowledge to see the unseen: A two-stream network for video object segmentation
  • Article

May 2024

·

5 Reads

Neurocomputing

·

Zhi Tian

·

Pengxu Wei

·

[...]

·

Wangmeng Zuo

Comparison with previous methods in terms of performance and efficiency on ADE20K dataset. The and bubbles in the accompanying graph represent the ViT Base and ViT Large models, respectively, with the size of each bubble corresponding to the FLOPs of the variant segmentation methods. SegViT-BEiT v2 Large achieves state-of-the-art performance with a 58.0% mIoU on the ADE20K validation set. Additionally, our efficient, optimized version, SegViT-Shrunk-BEiT v2 Large, saves half of the GFLOPs compared to UPerNet, significantly reducing computational overhead while maintaining a competitive performance of 55.7%
The overall concept of our Attention-to-Mask decoder. ATM learns the similarity map for each category by capturing the cross-attention between the class tokens and the spatial feature map (Left). Sigmoid\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\texttt{Sigmoid}$$\end{document} is applied to produce category-specific masks, highlighting the area with high similarity to the corresponding class (Middle). ATM enhances the semantic representations by encouraging the feature to be similar to the target class token and dissimilar to other tokens
The overall SegViT structure with the ATM module. The Attention-to-Mask (ATM) module inherits the typical transformer decoder structure. It takes in randomly initialized class embeddings as queries and the feature maps from the ViT backbone to generate keys and values. The outputs of the ATM module are used as the input queries for the next layer. The ATM module is carried out sequentially with inputs from different layers of the backbone as keys and values in a cascade manner. A linear transform is then applied to the output of the ATM module to produce the class predictions for each token. The mask for the corresponding class is transferred from the similarities between queries and keys in the ATM module. We have removed the self-attention mechanism in ATM decoder layers further improve the efficiency while maintaining the performance
Architecture of the proposed query-downsapling (QD) layer (blue block) and the query-upsampling (QU) layer (block). The QD layer uses an efficient down-sampling technique (green block) and removes less informative input tokens used for the query. The QU layer takes a set of trainable query tokens and learns to recover the discarded tokens using multi-head attention (Color figure online)
Illustrations of the Shrunk and Shrunk++. In the diagram, the and boxes respectively refer to the transformer encoder block and the patch embedding block. In SegVit (Zhang et al., 2022), the proposed Shrunk structure employs query downsampling (QD) on the middle-level features to preserve the information. In the new Shrunk++ architecture, we introduce the Edged Query Downsampling (EQD) technique which consolidates every four adjacent tokens into one token and additionally includes the tokens that contain edges. This enhancement enables downsampling operations to take place before the first layer without significant performance degradation, offering computational savings for the initial layers of the Shrunk model. The edge information is extracted using a lightweight parallel edge detection head

+3

SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers
  • Article
  • Full-text available

October 2023

·

118 Reads

·

38 Citations

International Journal of Computer Vision

This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder–decoder framework and introduce SegViTv2 . In this study, we introduce a novel Attention-to-Mask (ATM) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms popular decoder UPerNet using various ViT backbones while consuming only about 5%5\% 5 % of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk ++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to 50%50\% 50 % while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit .

Download

Fig. 3 The overall SegViT structure with the ATM module. The Attention-to-Mask (ATM) module inherits the typical transformer decoder structure. It takes in randomly initialized class embeddings as queries and the feature maps from the ViT backbone to generate keys and values. The outputs of the ATM module are used as the input queries for the next layer. The ATM module is carried out sequentially with inputs from different layers of the backbone as keys and values in a cascade manner. A linear transform is then applied to the output of the ATM module to produce the class predictions for each token. The mask for the corresponding class is transferred from the similarities between queries and keys in the ATM module. We have removed the self-attention mechanism in ATM decoder layers further improve the efficiency while maintaining the performance.
Fig. 4 Architecture of the proposed query-downsapling (QD) layer (blue block) and the query-upsampling (QU) layer (pink block). The QD layer uses an efficient downsampling technique (green block) and removes less informative input tokens used for the query. The QU layer takes a set of trainable query tokens and learns to recover the discarded tokens using multi-head attention.
Fig. 5 Illustrations of the Shrink and Shrink++. In the diagram, the blue and orange boxes respectively refer to the transformer encoder block and the patch embedding block. In SegVit [68], the proposed Shrunk structure employs query downsampling (QD) on the middle-level features to preserve the information. In the new Shrunk++ architecture, we introduce the Edged Query Downsampling (EQD) technique which consolidates every four adjacent tokens into one token and additionally includes the tokens that contain edges. This enhancement enables downsampling operations to take place before the first layer without significant performance degradation, offering computational savings for the initial layers of the Shrunk model. The edge information is extracted using a lightweight parallel edge detection head.
Ablation results of different decoder methods with their corresponding feature merge types and loss types. ViT-Base is employed as the backbone for all the variants.
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers

June 2023

·

118 Reads

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework and introduce SegViTv2. In our work, we implement the decoder with the global attention mechanism inherent in ViT backbones and propose the lightweight Attention-to-Mask module that effectively converts the global attention map into semantic masks for high-quality segmentation results. Our decoder can outperform the most commonly-used decoder UpperNet in various ViT backbones while consuming only about 5\% of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to 50%50\% while maintaining competitive performance. Furthermore, due to the flexibility of our ViT-based architecture, SegVit can be easily extended to semantic segmentation under the setting of continual learning, achieving nearly zero forgetting. Experiments show that our proposed SegViT outperforms recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: \url{https://github.com/zbwxp/SegVit}.


Figure 1. The overall network architecture of FastPillars. As shown in the top, taking the raw point cloud as inputs, FastPillars outputs the information of object classes, IoU, location offsets, dimension and heading angle information, and finally outputs the predicted 3D bounding boxes. As shown in the bottom, FastPillars consists of four parts: the pillar encoding module, a backbone, a neck and heads. In detail, the point cloud is firstly pillarization (one yellow column represents a pillar), and then all the columns are encoded, and the encoded features are sent to the CRVNet backbone for feature extraction. These features are fused by the neck, and finally, the heads similar to CenterPoint are applied for 3D box regression. Our CRVNet provides two optional backbone network variants, the CSPRepVGG backbone for FastPillars-s and the CSPRep-Res34 backbone for FastPillars-m. Best viewed in color.
Figure 2. Our Max-and-Attention Pillar Encoding (MAPE) module architecture. It consists of three units: point encoding, max-pooling encoding and attentive-pooling encoding. The input of this module is raw point cloud containing Cartesian coordinates, reflected intensity and relative timestamp. Here, we take a single pillar containing N points after the pillarization as an example. In the point encoding unit, the raw points are firstly augmented by the information of the pillar center and point cloud range, then the augmented point features are mapped to the feature space by an MLP. In the max-encoding unit, pillar-wise features are represented by a max-pooling operation across these point features. In the attentive-encoding unit, pillar-wise features are represented by a weighted summation operation on these point features. Finally, the current pillar feature is obtained by averaging the max-pool and attentive-pool features.
Figure 3. The architecture of CRVNet. CRVNet consists of CSP and RepVGG-style structures. Here we take VGG network as an example. (a) RepBlock is composed of a stack of RepVGG blocks with the activation functions at training. (b) During inference time, RepVGG block is converted to RepConv. (c) There are N RepConvs modules in each Repvgg stage.(d) CSPRep Block comprises three 1×1 convolutional layers and a stack of sub-blocks of N RepConvs followed by the activation functions with a residual connection.
FastPillars: A Deployment-friendly Pillar-based 3D Detector

February 2023

·

564 Reads

The deployment of 3D detectors strikes one of the major challenges in real-world self-driving scenarios. Existing BEV-based (i.e., Bird Eye View) detectors favor sparse convolutions (known as SPConv) to speed up training and inference, which puts a hard barrier for deployment, especially for on-device applications. In this paper, to tackle the challenge of efficient 3D object detection from an industry perspective, we devise a deployment-friendly pillar-based 3D detector, termed FastPillars. First, we introduce a novel lightweight Max-and-Attention Pillar Encoding (MAPE) module specially for enhancing small 3D objects. Second, we propose a simple yet effective principle for designing a backbone in pillar-based 3D detection. We construct FastPillars based on these designs, achieving high performance and low latency without SPConv. Extensive experiments on two large-scale datasets demonstrate the effectiveness and efficiency of FastPillars for on-device 3D detection regarding both performance and speed. Specifically, FastPillars delivers state-of-the-art accuracy on Waymo Open Dataset with 1.8X speed up and 3.8 mAPH/L2 improvement over CenterPoint (SPConv-based). Our code is publicly available at: https://github.com/StiphyJay/FastPillars.


Poseur: Direct Human Pose Regression with Transformers

November 2022

·

48 Reads

·

87 Citations

Lecture Notes in Computer Science

We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods. Code is available at: https://github.com/aim-uofa/Poseur. Keywords2d human pose estimationKeypoint detectionTransformer


SegViT: Semantic Segmentation with Plain Vision Transformers

October 2022

·

190 Reads

·

2 Citations

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to 40%40\% computations while maintaining competitive performance.


Two-Stream Networks for Object Segmentation in Videos

August 2022

·

35 Reads

Existing matching-based approaches perform video object segmentation (VOS) via retrieving support features from a pixel-level memory, while some pixels may suffer from lack of correspondence in the memory (i.e., unseen), which inevitably limits their segmentation performance. In this paper, we present a Two-Stream Network (TSN). Our TSN includes (i) a pixel stream with a conventional pixel-level memory, to segment the seen pixels based on their pixellevel memory retrieval. (ii) an instance stream for the unseen pixels, where a holistic understanding of the instance is obtained with dynamic segmentation heads conditioned on the features of the target instance. (iii) a pixel division module generating a routing map, with which output embeddings of the two streams are fused together. The compact instance stream effectively improves the segmentation accuracy of the unseen pixels, while fusing two streams with the adaptive routing map leads to an overall performance boost. Through extensive experiments, we demonstrate the effectiveness of our proposed TSN, and we also report state-of-the-art performance of 86.1% on YouTube-VOS 2018 and 87.5% on the DAVIS-2017 validation split.


Figure 3: Overall architecture. The overall architecture of FCOS-LiDAR resembles the 2D imagebased detector FCOS [2]. By taking as input an range image, the network obtains the multi-level FPN features, and then the classification and regression branches are attached to these feature levels to predict the final 3D boxes. Different from FCOS, the weights of the detection heads are not shared between the FPN levels as mentioned in Sec. 3.4. In addition, the class-specific regression heads are used instead of the class-agnostic ones in FCOS.
Figure 4: Visualization results of FCOS-LiDAR on the nuScenes val. set.
Multi-round range view (MRV) projection. Time: the elapsed time of MRV.
Inference time breakdowns. We compare against the state-of-the-art BEV-based Center- Point [8], which is trained with exactly the same strategies. FCOS-LiDAR is significantly faster as well as competitive in the multi-frame setting (and superior in the single-frame setting).
Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images

May 2022

·

213 Reads

·

5 Citations

We present a simple yet effective fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes, termed FCOS-LiDAR. Unlike the dominant methods that use the bird-eye view (BEV), our proposed detector detects objects from the range view (RV, a.k.a. range image) of the LiDAR points. Due to the range view's compactness and compatibility with the LiDAR sensors' sampling process on self-driving cars, the range view-based object detector can be realized by solely exploiting the vanilla 2D convolutions, departing from the BEV-based methods which often involve complicated voxelization operations and sparse convolutions. For the first time, we show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors while being significantly faster and simpler. More importantly, almost all previous range view-based detectors only focus on single-frame point clouds, since it is challenging to fuse multi-frame point clouds into a single range view. In this work, we tackle this challenging issue with a novel range view projection mechanism, and for the first time demonstrate the benefits of fusing multi-frame point clouds for a range-view based detector. Extensive experiments on nuScenes show the superiority of our proposed method and we believe that our work can be strong evidence that an RV-based 3D detector can compare favourably with the current mainstream BEV-based detectors.


Instance and Panoptic Segmentation Using Conditional Convolutions

January 2022

·

30 Reads

·

64 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditional convolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically follow the paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose to attend to the instances with dynamic conditional convolutions. Instead of using instance-wise ROIs as inputs to the instance mask head of fixed weights, we design dynamic instance-aware mask heads, conditioned on the instances to be predicted. CondInst enjoys three advantages: 1) Instance and panoptic segmentation are unified into a fully convolutional network, eliminating the need for ROI cropping and feature alignment. 2) The elimination of the ROI cropping also significantly improves the output instance mask resolution. 3) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference time per instance and making the overall inference time less relevant to the number of instances. We demonstrate a simpler method that can achieve improved accuracy and inference speed on both instance and panoptic segmentation tasks. On the COCO dataset, we outperform a few state-of-the-art methods. We hope that CondInst can be a strong baseline for instance and panoptic segmentation. Code is available at: https://git.io/AdelaiDet .


Figure 1. Comparing the proposed Poseur against heatmapbased methods with various backbone networks on COCO val. set. Baseline refers to heatmap-based methods. Heatmap-based baseline of MobileNet-V2 and ResNet use the same deconvolutional head as SimpleBaseline [36].
The effect of uncertainty estimation.
Comparison with different scale levels of backbone the COCO val set. Resi: i-th level feature map of ResNet.
Ablation study of different numbers of decoder layers on the COCO val set. N d is the number of decoder layers.
Comparisons with state-of-the-art methods on the COCO val set. Input size and the GFLOPs are shown for the single person pose estimation methods. SimBa: SimpleBaseline [36]. Unless specified, the number of decoder layers is set to 6.
Poseur: Direct Human Pose Regression with Transformers

January 2022

·

171 Reads

·

1 Citation

We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.


Citations (27)


... In object detection, models such as DETR (Detection Transformer) have made the detection pipeline simpler by dispensing with the requirement for hand-engineered parts and have shown competitive performance [28]. ViTs have also been successfully used in semantic segmentation, enhancing the accuracy of pixel-level classification tasks [29]. In addition, the generalization of ViTs has expanded their use in the video context for video understanding, whereby ViTs process the sequential frames to capture the spatial and temporal information to achieve goals like action recognition [30]. ...

Reference:

A Framework for Integrating Vision Transformers with Digital Twins in Industry 5.0 Context
SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers

International Journal of Computer Vision

... Recent human pose estimation methods [51,52,72] explore transformer-based architectures [2,4,77,95] due to their sparse, end-to-end design and promising performance. These methods treat human pose estimation as a direct set prediction problem and use bipartite matching to establish one-to-one instance correspondence during training. ...

Poseur: Direct Human Pose Regression with Transformers
  • Citing Chapter
  • November 2022

Lecture Notes in Computer Science

... The dot product in feature integration reduces the dimensionality of the output of ViT. Since the dot product is highly sensitive to the scale of the feature values, it can also merge multiple vectors into a scalar or a shorter vector, and the values with opposite signs will be canceled out, resulting in information loss [29,30]. The transformation of the integration strategy of feature vectors refined the model architecture. ...

SegViT: Semantic Segmentation with Plain Vision Transformers

... CornerNet (Law and Deng, 2018) and ExtremeNet (Zhou et al., 2019b) use the keypoint-based detection technique, which identifies the target's upper-left and lower-right corner points and then combines the corner points to produce a detection frame. FSAF (Zhu et al., 2019), FCOS (Tian et al., 2022), ...

Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images

... SOLO achieves an AP of 41.7, performing closely to the Baseline. Other methods such as HTC [32], PointRend, SOLOv2, Mask R-CNN, condinst [33], and rtmdet [34] have AP values of 39.8, 37.5, 38.4, 31.7, ...

Instance and Panoptic Segmentation Using Conditional Convolutions
  • Citing Article
  • January 2022

IEEE Transactions on Pattern Analysis and Machine Intelligence

... learning-based methods. Human pose estimation methods can be divided into two categories: heatmap-based methods [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23] and regression-based methods [24], [25], [26], [27], [28], [29], [30]. Compared with regression-based methods, heatmap-based methods can preserve spatial position information to provide richer supervision, resulting in a smoother training process and higher accuracy. ...

Poseur: Direct Human Pose Regression with Transformers

... Some studies focus on a similar task, HBox-to-Mask: 1) SDI [69] refines the segmentation through an iterative training process; 2) BBTP [70] formulates the HBox-supervised instance segmentation into a multiple-instance learning problem based on Mask R-CNN [71]; 3) BoxInst [72] uses the colorpairwise affinity with box constraint under an efficient RoIfree CondInst [73]; 4) BoxLevelSet [74] introduces an energy function to predict the instance-aware mask as the level set; 5) SAM (Segment Anything Model) [43] produces object masks from input Point/HBox prompts. Though RBoxes can be obtained from the segmentation mask by finding the minimum circumscribed rectangle, we show that such a cascade pipeline can be less cost-efficient (see Sec. IV). ...

BoxInst: High-Performance Instance Segmentation with Box Annotations
  • Citing Conference Paper
  • June 2021

... Furthermore, these two-stage approaches suffer from non-differentiable, hand-crafted post-processing steps that challenge optimization. Inspired by the one-stage object detectors [21,76], pixel-wise regression methods [49,53,55,64,73,75,78,81,93] densely predict pose candidates in an end-to-end fashion and apply Non-maximum Suppression (NMS) to obtain poses for different individuals. However, these methods produce redundant results, challenging the removal of duplicates. ...

FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware Convolutions
  • Citing Conference Paper
  • June 2021

... To further validate the performance of the InSAR-YOLOv8 model, this study trained the Faster-R-CNN 50 , RTMDet 73 , Double-head-faster-RCNN 74 , YOLOv3, Nas-fcos 75 , and YOLOvX 76 models under the identical conditions. Among them, Faster-R-CNN and YOLOv3 represent classic two-stage and one-stage detection models, respectively, and have been widely applied across various detection tasks. ...

NAS-FCOS: Efficient Search for Object Detection Architectures

International Journal of Computer Vision

... Although convolutional neural networks dominate the aforementioned research, the superior performance of Transformers in natural language processing based on the self-attention mechanism has attracted significant attention from the computer vision community. The exceptional performance of Transformers 33 in natural language processing has led to the development of numerous Transformerbased methodologies [34][35][36][37][38] for tackling high-level vision tasks, including image classification, 39,40 object detection, 41,42 and segmentation. 43,44 Specifically, Liang et al. 45 introduced a novel Transformer model for image restoration tasks. ...

Twins: Revisiting Spatial Attention Design in Vision Transformers