Jilai Zheng’s research while affiliated with Shanghai Jiao Tong University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (11)


SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer
  • Preprint
  • File available

February 2025

·

2 Reads

·

·

Jilai Zheng

·

[...]

·

Xiaokang Yang

Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum suppression (C-NMS) algorithm to precisely localize objects from noisy windows and a simple yet effective multi-scale strategy to improve accuracy. Extensive experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.

Download





VEON: Vocabulary-Enhanced Occupancy Prediction

July 2024

·

3 Reads

Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any types of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces relative depth but fails to estimate bin depth for feature lifting. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the long-tail problem. To address these issues, we propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting these foundation models. We first equip MiDaS with a Zoedepth head and low-rank adaptation (LoRA) for relative-metric-bin depth transformation while reserving beneficial depth prior. Then, a lightweight side adaptor network is attached to the CLIP vision encoder to generate high-resolution features for fine-grained 3D occupancy prediction. Moreover, we design a class reweighting strategy to give priority to the tail classes. With only 46M trainable parameters and zero manual semantic labels, VEON achieves 15.14 mIoU on Occ3D-nuScenes, and shows the capability of recognizing objects with open-vocabulary categories, meaning that our VEON is label-efficient, parameter-efficient, and precise enough.




Sparsely-Supervised Object Tracking

May 2024

·

4 Reads

·

1 Citation

IEEE Transactions on Image Processing

Recent years have witnessed the incredible performance boost of data-driven deep visual object trackers. Despite the success, these trackers require millions of sequential manual labels on videos for supervised training, implying the heavy burden of human annotating. This raises a crucial question: how to train a powerful tracker from abundant videos using limited manual annotations? In this paper, we challenge the conventional belief that frame-by-frame labeling is indispensable, and show that providing a small number of annotated bounding boxes in each video is sufficient for training a strong tracker. To facilitate that, we design a novel SParsely-supervised Object Tracking (SPOT) framework. It regards the sparsely annotated boxes as anchors and progressively explores in the temporal span to discover unlabeled target snapshots. Under the teacher-student paradigm, SPOT leverages the unique transitive consistency inherent in the tracking task as supervision, extracting knowledge from both anchor snapshots and unlabeled target snapshots. We also utilize several effective training strategies, i.e., IoU filtering, asymmetric augmentation, and temporal calibration to further improve the training robustness of SPOT. The experimental results demonstrate that, given less than 5 labels for each video, trackers trained via SPOT perform on par with their fully-supervised counterparts. Moreover, our SPOT exhibits two desirable properties: 1) SPOT enables us to fully exploit large-scale video datasets by efficiently allocating sparse labels to more videos even under a limited labeling budget; 2) when equipped with a target discovery module, SPOT can even learn from purely unlabeled videos for performance gain. We hope this work could inspire the community to rethink the current annotation principles and make a step towards practical label-efficient deep tracking.



Citations (7)


... LiDAR provides accurate 3D positional information, enabling accurate occupancy prediction, while cameras offer rich semantic details for object classification. Previous studies [1]- [3] have focused on integrating these two modalities by using 3D voxel representations generated from LiDAR data. To fuse camera features with 3D LiDAR features, these approaches predict depth from images, transform them from 2D to 3D views to create voxel-based image features, and then align these features with the LiDAR data in the voxel space. ...

Reference:

MR-Occ: Efficient Camera-LiDAR 3D Semantic Occupancy Prediction Using Hierarchical Multi-Resolution Voxel Representation
OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
  • Citing Chapter
  • November 2024

... Existing occupancy prediction approaches [3,9,11,23,28,35,44] primarily rely on supervised learning, which typically requires dense 3D annotations obtained through labor-intensive manual labeling of dynamic driving scenes spanning up to 80 meters per frame. To mitigate this cost, recent studies have resorted to self-supervised alternatives [2,7,8,10,13,31,42,45]. These methods leverage 2D predictions from vision foundation models (VLMs) to train a 3D occupancy network, enfocing image reprojection consistency through volume rendering [10,31,42] or differentiable rasterization [7]. ...

VEON: Vocabulary-Enhanced Occupancy Prediction
  • Citing Chapter
  • October 2024

... Backbone: The backbone of Elastic-DETR, inspired by advances in SparseViT ) and Sparse-Former (Li et al. 2024a), implements a sparsity strategy that selects windows of varying importance to economize on feature extraction computations. Notably, while the computational load is reduced during the feature extraction phase, the final feature map retains the same shape as those produced by denser models like Swin Transformer , ensuring compatibility with subsequent processing stages. ...

SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer
  • Citing Conference Paper
  • October 2024

... Fully supervised occupancy methods predict voxel-level semantics using dense voxel grids [9,17,34], depth priors [12,18], or sparse representations [11,21,27]. Despite their effectiveness, these approaches rely heavily on costly large-scale 3D annotations. ...

SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
  • Citing Conference Paper
  • June 2024

... To evaluate the effectiveness and robustness of our method, we compare it with six state-of-the-art trackers, including SDSTrack [5], Un-Track [44], VIPT [10], TBSI [8], BAT [4], and IIMF [12], on the LasHeR [9] dataset, as illustrated in Fig. 8. Specifically, in sequence (a), during the total occlusion (TO) challenge at frame 250, where the target is fully occluded by the basketball hoop, our method accurately tracks the target without losing its position. At frames 600 and 1100, significant distractions with similar appearances (SA) cause other trackers to fail, while ours maintains precise localization. ...

Single-Model and Any-Modality for Video Object Tracking
  • Citing Conference Paper
  • June 2024

... Meanwhile, unsupervised learning could leverage the availability of unlabelled data to train the models, significantly reducing the requirement for large labeled datasets. Hence, some researchers have begun to explore the unsupervised methods, like ResPUL [444], S2Siam [445], LUDT [446], UDT [447], CycleSiam [448] USOT [449] and ULAST [450]. The unsupervised learning strategies provide more possibilities in practical or specific applications when the training data is not annotated. ...

Learning to Track Objects from Unlabeled Videos
  • Citing Conference Paper
  • October 2021

... Based on the fact that a robust tracker should be effective in forwardbackward tracking, UDT [19] trained a tracker with the cycle consistency loss between the forward and backward tracking results. These unsupervised trackers [19,21,22,39] showed that cycle consistency loss after forward-backward tracking of multiple frames can effectively train the tracker. S 2 SiamFC [18] proposed a self-supervised training method to mine the spatial information in single frames. ...

Learning to Track Objects from Unlabeled Videos
  • Citing Preprint
  • August 2021