Available via license: CC BY 4.0
Content may be subject to copyright.
HeightFormer: A Semantic Alignment Monocular 3D Object Detection Method
from Roadside Perspective
Pei Liu1*, Zihaozhang2*, Haipeng Liu3*, Nanfang Zheng4, Yiqun Li4, Meixin Zhu1, Ziyuan Pu4
1Intelligent Transportation Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou),
Guangzhou, China
2School of Cyber Science and Engineering, Southeast University, Nanjing, 211189, China
3Li Auto Inc, Jiading District, Shanghai, 201800, China
4School of Transportation, Southeast University, Nanjing, 211189, China
Abstract
The on-board 3D object detection technology has received
extensive attention as a critical technology for autonomous
driving, while few studies have focused on applying road-
side sensors in 3D traffic object detection. Existing studies
achieve the projection of 2D image features to 3D features
through height estimation based on the frustum. However,
they did not consider the height alignment and the extraction
efficiency of bird’s-eye-view features. We propose a novel
3D object detection framework integrating Spatial Former
and Voxel Pooling Former to enhance 2D-to-3D projection
based on height estimation. Extensive experiments were con-
ducted using the Rope3D and DAIR-V2X-I dataset, and the
results demonstrated the outperformance of the proposed al-
gorithm in the detection of both vehicles and cyclists. These
results indicate that the algorithm is robust and generalized
under various detection scenarios. Improving the accuracy of
3D object detection on the roadside is conducive to build-
ing a safe and trustworthy intelligent transportation system of
vehicle-road coordination and promoting the large-scale ap-
plication of autonomous driving. The code and pre-trained
models will be released on https://anonymous.4open.science/
r/HeightFormer.
Introduction
Autonomous driving technology is developing rapidly as a
new transportation technology paradigm for reducing traffic
accidents and improving traffic efficiency. Perception tech-
nology is one of the most important technologies for au-
tonomous driving. Autonomous driving vehicles can obtain
surrounding environment information in order to make de-
cisions and take actions by sensors (Van Brummelen et al.
2018). According to the types of sensors adopted, percep-
tion technology can be divided into three types: point-cloud-
based perception, vision-based perception, and fusion per-
ception. Due to the high cost of lidar, many scholars believe
that vision-based perception is the main research direction
for promoting the mass production of autonomous vehicles
in the future (Ma et al. 2024; Muhammad et al. 2022). How-
ever, on-board cameras are limited in their perception range
*These authors contributed equally.
Copyright © 2025, All rights reserved.
of view due to the restricted installation height and are prone
to be occluded by surrounding vehicles, especially trucks or
buses. The blind spots may cause serious crashes. To fill the
gap, many researchers have begun to focus on the perfor-
mance of roadside cameras, expecting to provide accurate
and reliable perception results for autonomous vehicles (Ye
et al. 2023; Hao et al. 2024).
Roadside perception, due to the higher installation posi-
tion of sensors, has a broader perception range and fewer
blind spots compared to the perception of the ego vehicle.
Based on the characteristics of fixed and higher installation
positions, roadside cameras have better advantages in the 3D
object detection task. In the 3D object detection task, it is not
only necessary to classify objects but also to obtain informa-
tion such as the position, size, orientation, and speed of the
objects in the 3D space (Qian, Lai, and Li 2022). Therefore,
many researchers have carried out research on the monoc-
ular 3D object detection task from the roadside perspective
(Yang et al. 2024; Jinrang, Li, and Shi 2023). However, due
to the inconsistent types and installation methods of cam-
eras on the roadside, there may be different focal lengths
and pitch angles, and their perspectives are no longer paral-
lel to the ground. The traditional perception coordinate sys-
tem based on vehicles is no longer applicable to roadside
perception equipment. These problems have brought many
challenges to the roadside monocular 3D object detection
task.
The mainstream methods of vision-based 3D object detec-
tion include those leveraging the attention mechanism and
those employing frustum projection at present. The method
based on the attention mechanism mainly achieves detec-
tion by performing regression prediction on 3D detection
boxes (Li et al. 2022). However, since cameras do not pro-
vide image depth information, the perception accuracy of
this method in 3D detection is relatively low. The frustum-
based method estimates the height or depth of objects in 2D
images, constructs 3D objects through projection, and gener-
ates BEV (bird’s eye view) features through voxel pooling to
achieve object detection (Yang et al. 2023c; Li et al. 2023).
Because the frustum-based method has higher accuracy, and
height-based projection can improve the robustness of the
algorithm. Therefore, we mainly carried out the 3D object
arXiv:2410.07758v1 [cs.CV] 10 Oct 2024
detection task based on the idea of frustum projection.
Roadside perception can provide self-driving vehicles
with more extensive and precise environmental information
in the future. However, this requires first solving the noise
caused by different cameras parameters and overcoming the
problem of decreased object perception ability due to dif-
ferent pitch angles when cameras are installed, that is, to
improve the robustness, accuracy and credibility of road-
side perception. Based on those problems, we leveraged the
height projection method of the frustum and conducted ex-
tensive experiments with the Rope3D (Ye et al. 2022) and
DAIR-V2X-I (Yu et al. 2022) dataset. Compared with the
previous methods, our method has achieved some improve-
ments in the robustness and accuracy of the algorithm. The
main contributions of this paper are summarized as follows:
• We proposed a monocular 3D object detection method
from a roadside perspective. To address the issue of spa-
tial inconsistency when fusing the height feature with the
context feature, a deformable multi-scale spatial cross-
attention (DMSC) module is added in this paper to
achieve spatial alignment of the height feature and back-
ground feature and improve the robustness of the algo-
rithm.
• We added a self-attention mechanism during the BEV
feature extraction process to address the issue of low ac-
curacy in BEV feature extraction. This module takes into
account global background information and, through dy-
namic weight adjustment, increases the efficiency of in-
formation extraction and improves the robustness during
the BEV feature extraction process.
• We conducted extensive experiments to verify the accu-
racy of the proposed algorithm by the popular roadside
perception benchmarks, Rope3D (Ye et al. 2022) and
DAIR-V2X-I (Yu et al. 2022). In terms of the detection
of Cars and Big-vehicles in Rope3D, compared with the
state-of-the-art (SOTA) algorithm BEVHeight++ (Yang
et al. 2023b), the accuracy has increased from 76.12% to
78.49% and from 50.11% to 60.69% respectively when
the intersection over union (IoU) is 0.5. Meanwhile, by
comparing the detection results under different difficul-
ties, it has been verified that our method is robust.
Related Works
Vision-based 3D Detection for Autonomous Driving
In recent years, the methods of vision-based 3D object de-
tection have attracted the attention of both academia and in-
dustry due to their low cost and continuously improving per-
formance (Ma et al. 2024). The goal of 3D detection based
on images is to output the categories and positions of objects
in the input RGB images. As shown in Figure 1, the middle
image represents the information that needs to be obtained in
the 3D object detection task, which includes the center point
coordinate C(x, y, z) of the object, the object’s dimensions
of length (L), width (W), height (H), and the yaw angle θof
the object in the Y direction. The image on the right shows
the representation of the object in the BEV perspective. The
BEV view can more conveniently serve the perception of
downstream tasks, i.e., planning.
Currently, the image-based 3D object detection methods
are mainly divided into methods transformer-based for pre-
dicting 3D detection boxes and frustum-based to estimate
the depth and height of the target using 2D features. The
main working principle of Transformer-based methods is
to establish the connection between 3D position features
and 2D image features through queries. According to dif-
ferent query objects, it can be divided into queries for the
target set (Wang et al. 2022; Liu et al. 2022; Chen et al.
2022) and queries for the BEV grid (Li et al. 2022; Yang
et al. 2023a; Jiang et al. 2023). The most representative
algorithms among them are DETR3D (Wang et al. 2022)
and BEVFormer (Li et al. 2022), respectively. Transformer-
based methods are generally applied in multi-view vehicle-
mounted scenarios (Caesar et al. 2020), and their applicabil-
ity is not good for monocular roadside scenarios.
The Lift-splat-shoot (LSS) method based on frustum lifts
the 2D image features into a view cone and then splats them
to the BEV grid (Philion and Fidler 2020). Many subsequent
methods have adopted the idea of LSS (Huang et al. 2021;
Li et al. 2023; Yang et al. 2023c,b). BEVDet uses the LSS
method as the view transformer to convert the image view
features to the BEV space, and by constructing a unique data
augmentation method and an improved non-maximum sup-
pression (NMS), it significantly improves the performance
of multi-camera 3D object detection (Huang et al. 2021).
Other methods adopting LSS use point cloud data as su-
pervision to estimate the depth and height. BEVDepth is
improved based on the LSS framework by introducing ex-
plicit depth supervision and a camera-aware depth estima-
tion module and designing a depth refinement module to en-
hance the accuracy of depth prediction, thus achieving the
new best performance in the multi-view 3D object detection
task (Li et al. 2023). Since, in the roadside perspective, the
optical axis of the camera is not parallel to the ground, there
are many challenges in estimating the depth, and the robust-
ness is relatively low.
Based on the features of the roadside monocular camera
dataset, BEVHeight improves the LSS method by predict-
ing the height of pixels relative to the ground instead of the
depth, addressing the shortcomings of the traditional LSS
method in depth prediction from the roadside perspective
and enhancing the detection performance and the model’s
robustness to variations in camera installation height (Yang
et al. 2023c). On this basis, BEVHeight++ combines depth
and height estimation through the cross-attention mech-
anism, improving the performance of BEVHeight in the
multi-view scenarios of vehicle-mounted cameras (Yang
et al. 2023b). However, when fusing the height feature and
the context feature, these methods adopt a pixel-by-pixel
fusion approach and do not achieve spatial alignment effi-
ciently. Therefore, our framework efficiently fuses the height
and context features through the DMSC module.
Spatial Attention and Vision Transformer
The attention mechanism has been widely applied to solve
the problem of key region identification in image under-
standing and analysis in computer vision, such as im-
age classification, object detection, semantic segmentation,
Figure 1: 3D Detection Box Generation and BEV Perspective Overview Diagram. The detection box generation adopts the
7-parameter method. In the middle figure, L, W, H represent length, width, and height, respectively, C represents the coordinate
(x, y, z) of the center point of the detection box, and θrepresents the yaw angle.
video understanding, image generation, 3D vision, multi-
modal tasks, and self-supervised learning (Guo et al. 2022).
The Vision Transformer (ViT) splits the image into multi-
ple patches and regards these patches as elements in a se-
quence. Then, it processes the sequence of patches using a
standard Transformer encoder like natural language process-
ing (Dosovitskiy et al. 2020). After pre-training on large-
scale datasets, it demonstrates outstanding performance on
various image recognition tasks. Subsequently, Xu et al.
innovated the heterogeneous multi-agent self-attention and
multi-scale window self-attention modules based on ViT,
effectively improving the 3D object detection performance
of the algorithm in complex noisy environments (Xu et al.
2022). The development of the ViT framework has signifi-
cantly enhanced the performance of computer vision algo-
rithms in tasks such as image classification and 3D object
detection.
Besides ViT, spatial attention (Carion et al. 2020) and
temporal attention (Xu et al. 2017) have been widely applied
to computer vision tasks to explore global spatial and tem-
poral information and avoid limited perceptual fields from
convolution. VISTA enhances the 3D object detection per-
formance of the algorithm through the cross-view spatial
attention module, enabling the attention mechanism to fo-
cus on specific target areas (Deng et al. 2022). To address
the problems of slow convergence speed and limited spa-
tial resolution existing in the spatial attention mechanism,
Deformable-DETR enables the attention module to focus
only on a small number of key sampling points around
the reference points (Zhu et al. 2020). Considering that we
need to match the height features with the background fea-
tures, the deformable multi-scale spatial cross-spatial atten-
tion module is adopted to efficiently fuse the height feature
and the context features and to achieve the improvement of
the algorithm performance. Meanwhile, it enhances the ef-
ficiency of BEV feature extraction through ViT to improve
the accuracy of 3D object detection.
Methodology
Figure 2 shows the overall framework of the proposed road-
side monocular 3D object detection algorithm. Initially, an
image of a fixed size is input, and its 2D features are ex-
tracted through the backbone network. Subsequently, the
height features of the image are obtained through the height
network (Yang et al. 2023c) and combined with the context
features and camera parameters to achieve the fused features
considering the camera parameters and the object height fea-
tures. The improved project method in 2D to 3D projector is
adopted to extract 3D objects, and then voxel pooling is per-
formed on the 3D features. The BEV features are obtained
by the self-attention module, ultimately yielding the results
of 3D objects through the detection head.
Fusion of Height Feature with Context
As shown in Figure 2, height information is a critical fea-
ture. The Height Net extracts the height information from
the 2D image features, and the output height feature map is
the same shape as the input image. Since the traditional LSS
perception method would lose the height information during
the voxel pooling process (Yang et al. 2023b), it is necessary
to fuse the height feature with the environmental feature be-
fore implementing the lift operation to retain the height in-
formation. Existing methods adopt the pixel-by-pixel stitch-
ing approach to fuse the height and background informa-
tion (Yang et al. 2023c). However, in roadside cameras, spa-
tial misalignment may occur due to lighting conditions and
viewpoint angles. Therefore, we adopted the DMSC mod-
ule to fuse the height and context features before ”lifting” as
shown in the following equation:
Ffused =DM SC(Fcontext , Hpred)(1)
Where Ffused is the fused feature, DM SC (·)is the
deformable multi-scale spatial cross-attention module,
Fcontext denotes the context feature, Hpred is the height fea-
ture obtained by the height network prediction.
Meanwhile, Figure 3 shows the detailed process of the fu-
sion of height features and context features. Based on the
2D features of the image, the height of the traffic target in
Figure 2: Overview of Our Method Architecture. The left-top is the input image; the image backbone extracted 2D features
from the input image. After fusing the height features, context features and camera parameters, projected these features into
3D features by projector. Then, BEV features can be obtained by voxel pooling and the self-attention module. Finally, the 3D
object detection head obtained detected results.
the image is estimated through the height network, and the
estimated height features and the context features are used
together as the input of the DMSC module. Through the
DMSC module, the 2D fused features that fuse the height
and background features are obtained.
Figure 3: Deformable Multi-scale Spatial Cross-attention
Fused Height Feature and Context Feature.
2D to 3D Projection Based on Frustum
We adopted the frustum projection method (Yang et al.
2023c), which is suitable for height estimation for project-
ing the 2D features into the 3D space. The specific projection
process is shown in Figure 4. Since the 2D features incorpo-
rate the height information, for any pixel point Pimage.i in
the image, it can be represented by the pixel position (u, v)
and the height hi, that is, Pimage.i = (u, v, hi). Firstly, con-
struct a coordinate system with the virtual coordinate origin
Ovirtual and a reference plane with a depth of 1, and convert
the pixel points into the points in the camera coordinate sys-
tem according to the internal parameter matrix of the cam-
era. It is as follows:
Pcam
ref.i =K−1[u, v , 1, hi]T(2)
where Pcam
ref.i represents the i-th point in the camera coor-
dinate system of the reference plane, and Krepresents the
internal parameter matrix of the camera.
Since the camera coordinate is generally not perpendicu-
lar to the ground, it needs to be converted to a virtual coordi-
nate system where the Y-axis is perpendicular to the ground.
Pvirt
ref.i =Tv irt
cam Pcam
ref.i (3)
Where Pvirt
ref.i is the i-th point in the virtual coordinate sys-
tem of the reference plane, and Tvirt
cam is the transformation
matrix from the camera coordinate system to the virtual co-
ordinate system.
Suppose the distance from the origin of the virtual coor-
dinate system to the ground is H. The point Pvirt
gd.i in the
virtual coordinate corresponding to the estimated height hi
in the ground plane is found through the principle of similar
triangles, as follows:
Pvirt
gd.i =H−hi
yvirt
ref.i
Pvirt
ref.i (4)
Where yvirt
ref.i represents the coordinate of point Pv irt
ref.i in the
Y-axis direction.
After a series of coordinate transformations and plane
transformations, the point Pvirt
gd.i in the virtual coordinate
system on the ground plane is obtained. In order to facili-
tate the detection of its position, it needs to be converted to
the vehicle coordinate system as follows:
Pego
gd.i =Tego
virt Pvirt
gd.i (5)
Figure 4: The Diagram Illustrating the Frustum Projection
Using Height Estimation. The bounding box features de-
tected by 2D are fused into the viewpoint and projected into
the 3D space through height estimation.
BEV Feature Extraction
Compared with the 3D feature space, the BEV feature space
can better capture the positions of traffic objects, thereby
generating 3D detection boxes. Meanwhile, pooling the
voxel features into BEV features reduces the computational
load and is more adaptable to the computing characteristics
of roadside perception devices. In order to improve the ex-
traction efficiency of BEV features, we processed the pool-
ing feature map using the self-attention module to reduce the
noise interference of voxel pooling. The pooling operation is
as follows:
FBEV =V P F (FP ooling )(6)
Where FBEV represents the self-attention module, V P F (·)
is the voxel poling former module, and FPool ing means the
feature layer after the voxel pooling.
The details of the method are shown in Figure 5: For the
feature map, set an appropriate patch size and divide the
features into multiple patches. The smaller the patch size,
the finer the division, and the more feature details are ob-
tained. After the division is completed, the features are pro-
cessed through the self-attention heads. Each head indepen-
dently calculates the weights between different patches, and
the features are further processed and refined through the
multi-layer perceptron (MLP) module. Finally, the features
are aggregated to obtain the BEV feature map.
Experiments
Dataset
Rope3D.The data adopted in this experiment is the road-
side camera dataset Rope3D, which was launched by the
Institute for AI Industry Research (AIR), Tsinghua Univer-
sity in 2022 (Ye et al. 2022). This dataset collected about
50,000 images at 26 intersections under different time peri-
ods, weather conditions, densities, and distributions of traf-
fic participants through roadside cameras. It has rich and di-
verse image data. Meanwhile, as the positions and pitch an-
Figure 5: Schematic Diagram of Obtaining BEV Features by
Self-Attention. The input on the left is the feature map after
voxel pooling. The feature map is matched with fixed-size
patches, and the BEV feature map is obtained through the
multi-head attention mechanism and the MLP module.
gles where the cameras are installed are different, and the
focal lengths of different cameras are also different, these
differences raise the requirements for the robustness of the
detection algorithm. It is worth mentioning that this dataset
acquired point cloud data through vehicles equipped with
LiDAR to calibrate the 3D information of traffic targets in
the image data in the same scenarios as the camera collec-
tion. So, Rope3D has a relatively high annotation quality.
This dataset annotates about 1,500,000 3D traffic objects.
The marked traffic participants in this dataset include four
major categories: Car, Big-vehicle, Pedestrian, and Cyclist.
It is to be subdivided into 13 minor categories, such as Big-
vehicle, which includes Truck and Bus. The label distribu-
tion of various categories is relatively uniform, reducing the
influence of the long-tail effect on the detection tasks and
algorithms. As a popular roadside perception benchmark, in
order to uniformly compare the performance of various al-
gorithms, this dataset proposed AP3D|R40 and Ropescore
as the evaluation indicators of the data. At present, some
researchers have tested 3D object detection algorithms in
the monocular roadside view based on this dataset, such
as YOLOv7-3D (Ye et al. 2023), BEVhehight (Yang et al.
2023c), BEVSpread (Wang et al. 2024).
DAIR-V2X-I. In order to promote the computer vision
research and innovation of Vehicle-Infrastructure Coopera-
tive Autonomous Driving (VICAD), Yu et al. released the
DAIR-V2X dataset (Yu et al. 2022). This is the first large-
scale, multimodal, and multi-view real vehicle scene dataset,
containing 71,254 frames of LiDAR (Light Detection and
Ranging) frames and the same number of camera frames.
All frames are from real scenes and come with 3D annota-
tions. In this paper, we mainly use the dataset DAIR-V2X-
I for the roadside view among them. This dataset contains
10,000 images, which are divided into training set, valida-
tion set and test set in the ratio of 5:2:3. Since the test set
data has not been released yet, we use the validation set to
verify the performance of the model. The main verification
metric adopted is average perception.
Experiment Setting
To improve the robustness of the 3D object detection algo-
rithm and enhance the credibility of the roadside percep-
tion system, we conducted extensive experiments based on
the Rope3D and DAIR-V2X-I dataset. For the architectural
details of the experiment, we adopt ResNet-101 (He et al.
2016) as the backbone network for image feature extrac-
tion and set the resolution of the input image as (864, 1536).
Meanwhile, in order to increase the robustness of the algo-
rithm, we adopted the method of data augmentation, that is,
scaling and rotating the original image data. The experimen-
tal equipment we adopted is six NVIDIA A800 GPUs for
the training and validation of the model. The data batch on
each GPU device is set to 2, and we employ the AdamW op-
timizer (Kingma and Ba 2014) with an initial learning rate
of 2e−4. In order to compare with other algorithms, the
number of training epochs is 150 (Yang et al. 2023b).
Metrics
In order to compare with other State-of-the-art (SOTA) al-
gorithms, we referred to the metric AP |R40 (Simonelli et al.
2019) used in the Rope3D dataset and AP used in DAIR-
V2X-I dataset as the main evaluation metric to evaluate the
accuracy of the model algorithm. The calculation method of
the metric is as follows:
AP |R=1
|R|X
r∈R
max
˜r:˜r≥rρ(˜r)(7)
Where ρ(˜r)represents the accuracy rate under the deter-
mined regression threshold r∈{1/40, 2/40,...,1}.
Meanwhile, an indicator that is more suitable for the
Rope3D dataset is also adopted (Ye et al. 2022). This
indicator incorporates Average Ground Center Similarity
(ACS), Average Orientation Similarity (AOS), Average Four
Ground Points Distance (AGD), and Average Four Ground
Points Similarity (AGS) by setting weights, assume S=
(ACS +AOS +AAS +AGS)/4. As shown in the fol-
lowing equation:
Ropescore = (ω1∗AP +w2∗S)/(ω1+ω2)(8)
Where AP is the value calculated by equation 7, the value of
the weight ω1is set to 8, and the value of ω2is set to 2.
Comparing with State-of-the-art
Results on the Rope3D. We mainly compared 3D object
detection methods that adopt the BEV method, including
BEVFormer (Li et al. 2022), BEVDepth (Li et al. 2023),
BEVHeight (Yang et al. 2023c), and BEVHeight++ (Yang
et al. 2023b). Among them, BEVHeight mainly focuses on
roadside monocular detection, BEVFormer and BEVDepth
focus on multi-view 3D object detection, and BEVHeight++
integrates height and depth information and also has a good
performance in multi-view scenarios. Table 1 shows the de-
tection situations of various IoU for Car and Big-vehicle un-
der the Rope3D test dataset. The results show that compared
with the SOTA algorithm BEVHeight++, our algorithm in-
creases the detection accuracy of Car and Big-vehicle by
2.37% and 10.58%, respectively, when IoU is 0.5, and in-
creases the detection accuracy of Big-vehicles by 1.40%
when IoU is 0.7. However, the detection result of small cars
lags behind the BEVHeight++ algorithm by 4.84%.
To test the robustness of the algorithm, we divided the dif-
ficulty level of object detection into three grades: Easy, Mid,
and Hard, according to the distance between the object and
the camera, occlusion, and truncation. Easy corresponds to
objects without occlusion and no occlusion segments, Mid
corresponds to targets with lateral occlusion at 0% - 50%,
and Hard corresponds to targets with longitudinal occlusion
at 50% - 100%. Then, the detection accuracy of the three
traffic objects, Vehicle, and Cyclist, was compared. Consid-
ering that the objects of Cyclist are relatively small and the
detection is more difficult, referring to the existing meth-
ods (Yang et al. 2023c,b), the IoU was set to 0.25 in this
paper. Table 2 indicates that for the two detection targets
of Vehicle and Cyclist under different difficulty conditions,
our algorithm has improved by 8.71%/9.41%/9.23% and
6.70%/5.28%/4.71% respectively compared to the SOTA al-
gorithm BEVheight++. According to the performance de-
cline under different difficulties, our algorithm shows better
robustness.
Figure 6 shows the detection results of traffic targets. This
indicates that the algorithm has a higher detection accuracy
for larger traffic targets located in the center of the image.
However, for smaller traffic targets on both sides of the im-
age, the effect is not ideal.
Figure 6: Visualized Detection Results. In the upper chart,
the camera icon indicates the direction in which the camera
is located. Blue represents the labeled detection box, green
indicates correct detection, and red indicates incorrect de-
tection, with no marking for undetected items. In the lower
chart, the green detection box represents a small car, the yel-
low detection box represents a big-vehicle and the red detec-
tion box represents a cyclist.
Results on the DAIR-V2X-I. For the validation re-
sults on the DAIR-V2X-I dataset, we compared multi-
ple SOTA algorithms such as PointPillars (Lang et al.
2019), SECOND (Yan, Mao, and Li 2018), MVXNet
(Sindagi, Zhou, and Tuzel 2019), ImvoxclNet(Rukhovich,
Vorontsova, and Konushin 2022), BEVFormer (Li et al.
2022), BEVDepth (Li et al. 2023), BEVHeight (Yang et al.
2023c), BEVHeight++ (Yang et al. 2023b), etc. As can
be seen from Table 3, our algorithm is comprehensively
superior to the existing algorithms in the 3D object de-
tection tasks of vehicles and cyclists, and is superior to
other vision-based algorithms in the hard mode of pedes-
trian detection. Compared with the height-estimation-based
benchmark algorithm BEVHeight, our algorithm improves
by 1.65%/3.44%/3.37% in the Vehicle detection task, by
1.49%/1.55%/1.67% in the pedestrian detection task, and by
0.56%/0.57%/0.59% in the cyclist detection task.
Method
IoU=0.5 IoU=0.7
Car Big-vehicle Car Big-vehicle
AP3DRscore AP3DRscore AP3DRscore AP3DRscor e
BEVFormer 50.62 58.78 34.58 45.16 24.64 38.71 10.08 26.16
BEVDepth 69.63 74.70 45.02 54.64 42.56 53.05 21.47 35.82
BEVHeight 74.60 78.72 48.93 57.70 45.73 55.62 23.07 37.04
BEVHeight++ 76.12 80.91 50.11 59.92 47.03 57.77 24.43 39.57
Ours 78.49 81.72 60.69 67.15 42.19 52.70 25.83 39.33
AP and Rope represent AP |R40 and Ropescore
Table 1: Vision-based 3D Detection Performance of Car and Big-vehicle on the Rope3D Val set.
Method Veh. (IoU=0.5) Cyc. (IoU=0.25)
Easy Mid Hard Easy Mid Hard
BEVFormer 61.37 50.73 50.73 22.16 22.13 22.00
BEVDepth 71.56 60.75 60.85 40.83 40.66 40.26
BEVHeight 75.58 63.49 63.59 47.97 47.45 48.12
BEVHeight++ 76.98 65.52 65.64 48.14 48.11 48.63
Ours 85.69 74.93 74.87 54.84 53.39 53.34
Veh., Cyc. represent Vehicle and Cyclist respectively.
Easy, Mid and Hard indicate the difficulty setting levels of detection.
Table 2: vision-based 3D detection performance of Car and
Cyclist on the Rope3D val set.
Method M Veh.(IoU =0.5) Ped.(I oU =0.25) Cyc.(IoU =0.25)
Easy Mid Hard Easy Mid Hard Easy Mid Hard
PointPillars L 63.07 54.00 54.01 38.53 37.20 37.28 38.46 22.60 22.49
SECOND L 71.47 53.99 54.00 55.16 52.49 52.52 54.68 31.05 31.19
MVXNet LC 71.04 53.71 53.76 55.83 54.45 54.40 54.05 30.79 31.06
ImvoxelNet C 44.78 37.58 37.55 6.81 6.746 6.73 21.06 13.57 13.17
BEVFormer C 61.37 50.73 50.73 16.89 15.82 15.95 22.16 22.13 22.06
BEVDepth C 75.50 63.58 63.67 34.95 33.42 33.27 55.67 55.47 55.34
BEVHeight C 77.78 65.77 65.85 41.22 39.29 39.46 60.23 60.08 60.54
BEVHeight++ C 79.31 68.62 68.68 42.87 40.88 41.06 60.76 60.52 61.01
Ours C 79.43 69.21 69.22 42.79 40.84 41.13 60.79 60.65 61.13
M, L, C denotes modality, LiDAR, camera respectively.
Table 3: Comparing with the state-of-the-art on the DAIR-
V2X-I val set.
Discussion
Aiming at the problems existing in roadside perception, we
added the Spatial Former and Voxel Pooling Former mod-
ules on the basis of the existing height estimation method
framework to improve the spatial alignment ability of height
features and context features and the extraction efficiency of
pooling features. To verify the effectiveness of our method,
extensive experiments have been conducted based on the
public dataset Rope3D (Ye et al. 2022) and DAIR-V2X-I
(Yu et al. 2022). The experimental results show that the road-
side monocular 3D object detection framework we proposed
has outperformed the currently popular algorithms in the de-
tection effect of vehicles and cyclists. Meanwhile, the results
of different detection difficulties indicate that our algorithm
is robust and can still detect traffic targets well as the detec-
tion difficulty increases. Currently, many countries around
the world have installed roadside cameras. Applying these
devices to the development of autonomous driving technol-
ogy will provide a safer and more reliable autonomous driv-
ing environment and promote the development of intelligent
transportation (Z. Zou et al. 2022; Creß, Bing, and Knoll
2024). Our algorithm can provide technical support for the
large-scale application and implementation of autonomous
driving and promote the development of an intelligent trans-
portation system of vehicle-road coordination.
Although the algorithm we proposed has achieved good
results on vehicles and cyclists, there are still the following
some deficiencies, and the subsequent research will continue
to be improved. Firstly, since pedestrians usually appear on
both sides of the road and it is difficult for roadside cameras
to clearly capture pedestrian information, we did not detect
pedestrian targets. Subsequently, the model framework will
be enhanced based on the characteristics of pedestrian tar-
gets from the roadside perspective to achieve 3D detection
of pedestrians at intersections. Finally, due to limited com-
puting resources, we did not conduct ablation experiments
to fully demonstrate the rationality of the model structure. In
the subsequent research, a large number of ablation experi-
ments will be conducted to compare the effects of different
modules.
Conclusion
In the field of 3D object detection for autonomous driv-
ing, there are problems such as low perception accuracy
and weak algorithm robustness caused by equipment spec-
ifications, installation angles, and non-parallelism between
the camera optical axis and the ground in roadside monoc-
ular detection algorithms. We proposed a roadside monoc-
ular 3D object detection framework based on Spatial For-
mer and Voxel Pooling Former, which improves the perfor-
mance and robustness of the algorithm based on the height
estimation method. Through extensive experiments on the
popular Rope3D and DAIR-V2X-I benchmark, the results
show that in the 3D object detection of Big-vehicle when
IoU is 0.5, our algorithm has increased from 50.11% to
60.69% compared to the SOTA algorithm BEVHeight++. In
the detection of vehicles and cyclist under different detec-
tion difficulties, it has improved by 8.71%/9.41%/9.23% and
6.70%/5.28%/4.71%, respectively. The experimental results
of DAIR-V2X-I show that the algorithm proposed in this pa-
per improves by 1.65%/3.44%/3.37%, 1.49%/1.55%/1.67%,
and 0.56%/0.57%/0.59% respectively in the detection of
vehicles, pedestrians, and cyclist compared to the height-
estimation-based benchmark algorithm BEVHeight. Exper-
imental results show that our algorithm has good detection
performance for vehicles and cyclists. At the same time, as
the detection difficulty increases, the robustness of the al-
gorithm is good. Our algorithm improves the performance
of the roadside monocular 3D object detection method, en-
hances the safety and reliability of autonomous driving per-
ception, helps build an intelligent transportation system of
vehicle-road coordination, and promotes the application of
autonomous driving technology.
References
Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.;
Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O.
2020. nuscenes: A multimodal dataset for autonomous driv-
ing. In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, 11621–11631.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov,
A.; and Zagoruyko, S. 2020. End-to-end object detection
with transformers. In European conference on computer vi-
sion, 213–229. Springer.
Chen, S.; Wang, X.; Cheng, T.; Zhang, Q.; Huang,
C.; and Liu, W. 2022. Polar parametrization for
vision-based surround-view 3d detection. arXiv preprint
arXiv:2206.10965.
Creß, C.; Bing, Z.; and Knoll, A. C. 2024. Intelligent Trans-
portation Systems Using Roadside Infrastructure: A Litera-
ture Survey. 25(7): 6309–6327.
Deng, S.; Liang, Z.; Sun, L.; and Jia, K. 2022. Vista: Boost-
ing 3d object detection via dual cross-view spatial attention.
In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 8448–8457.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn,
D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.;
Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16
words: Transformers for image recognition at scale. arXiv
preprint arXiv:2010.11929.
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.;
Mu, T.-J.; Zhang, S.-H.; Martin, R. R.; Cheng, M.-M.; and
Hu, S.-M. 2022. Attention mechanisms in computer vision:
A survey. Computational visual media, 8(3): 331–368.
Hao, R.; Fan, S.; Dai, Y.; Zhang, Z.; Li, C.; Wang, Y.; Yu,
H.; Yang, W.; Yuan, J.; and Nie, Z. 2024. RCooper: A Real-
world Large-scale Dataset for Roadside Cooperative Per-
ception. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 22347–
22357.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual
Learning for Image Recognition. In 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 770–
778. Las Vegas, NV, USA: IEEE. ISBN 978-1-4673-8851-1.
Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; and Du, D. 2021.
Bevdet: High-performance multi-camera 3d object detection
in bird-eye-view. arXiv preprint arXiv:2112.11790.
Jiang, Y.; Zhang, L.; Miao, Z.; Zhu, X.; Gao, J.; Hu, W.;
and Jiang, Y.-G. 2023. Polarformer: Multi-camera 3d object
detection with polar transformer. In Proceedings of the AAAI
conference on Artificial Intelligence, 1042–1050.
Jinrang, J.; Li, Z.; and Shi, Y. 2023. MonoUNI: A Unified
Vehicle and Infrastructure-side Monocular 3D Object De-
tection Network with Sufficient Depth Clues. In Oh, A.;
Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and
Levine, S., eds., Advances in Neural Information Process-
ing Systems, volume 36, 11703–11715. Curran Associates,
Inc.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980.
Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; and
Beijbom, O. 2019. Pointpillars: Fast encoders for object de-
tection from point clouds. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition,
12697–12705.
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.;
and Li, Z. 2023. Bevdepth: Acquisition of reliable depth for
multi-view 3d object detection. In Proceedings of the AAAI
Conference on Artificial Intelligence, 1477–1485.
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao,
Y.; and Dai, J. 2022. BEVFormer: Learning Bird’s-Eye-
View Representation from Multi-camera Images via Spa-
tiotemporal Transformers. In Avidan, S.; Brostow, G.; Ciss´
e,
M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision
– ECCV 2022, 1–18. Cham: Springer Nature Switzerland.
ISBN 978-3-031-20077-9.
Liu, Y.; Wang, T.; Zhang, X.; and Sun, J. 2022. Petr: Posi-
tion embedding transformation for multi-view 3d object de-
tection. In European Conference on Computer Vision, 531–
548. Springer.
Ma, X.; Ouyang, W.; Simonelli, A.; and Ricci, E. 2024. 3D
Object Detection From Images for Autonomous Driving: A
Survey. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence, 46(5): 3537–3556.
Muhammad, K.; Hussain, T.; Ullah, H.; Ser, J. D.; Rezaei,
M.; Kumar, N.; Hijji, M.; Bellavista, P.; and De Albu-
querque, V. H. C. 2022. Vision-Based Semantic Segmen-
tation in Scene Understanding for Autonomous Driving:
Recent Achievements, Challenges, and Outlooks. IEEE
Transactions on Intelligent Transportation Systems, 23(12):
22694–22715.
Philion, J.; and Fidler, S. 2020. Lift, splat, shoot: Encoding
images from arbitrary camera rigs by implicitly unproject-
ing to 3d. In Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, August 23–28, 2020, Proceed-
ings, Part XIV 16, 194–210. Springer.
Qian, R.; Lai, X.; and Li, X. 2022. 3D Object Detection for
Autonomous Driving: A Survey. Pattern Recognition, 130:
108796.
Rukhovich, D.; Vorontsova, A.; and Konushin, A. 2022.
Imvoxelnet: Image to voxels projection for monocular and
multi-view general-purpose 3d object detection. In Proceed-
ings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, 2397–2406.
Simonelli, A.; Bulo, S. R.; Porzi, L.; L´
opez-Antequera, M.;
and Kontschieder, P. 2019. Disentangling monocular 3d ob-
ject detection. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, 1991–1999.
Sindagi, V. A.; Zhou, Y.; and Tuzel, O. 2019. Mvx-net:
Multimodal voxelnet for 3d object detection. In 2019 Inter-
national Conference on Robotics and Automation (ICRA),
7276–7282. IEEE.
Van Brummelen, J.; O’Brien, M.; Gruyer, D.; and Najjaran,
H. 2018. Autonomous vehicle perception: The technology
of today and tomorrow. Transportation Research Part C:
Emerging Technologies, 89: 384–406.
Wang, W.; Lu, Y.; Zheng, G.; Zhan, S.; Ye, X.; Tan,
Z.; Wang, J.; Wang, G.; and Li, X. 2024. BEVSpread:
Spread Voxel Pooling for Bird’s-Eye-View Representation
in Vision-based Roadside 3D Object Detection. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 14718–14727. Seattle
WA, USA: IEEE.
Wang, Y.; Guizilini, V. C.; Zhang, T.; Wang, Y.; Zhao, H.;
and Solomon, J. 2022. Detr3d: 3d object detection from
multi-view images via 3d-to-2d queries. In Conference on
Robot Learning, 180–191. PMLR.
Xu, R.; Xiang, H.; Tu, Z.; Xia, X.; Yang, M.-H.; and Ma,
J. 2022. V2x-vit: Vehicle-to-everything cooperative percep-
tion with vision transformer. In European conference on
computer vision, 107–124. Springer.
Xu, S.; Cheng, Y.; Gu, K.; Yang, Y.; Chang, S.; and Zhou, P.
2017. Jointly attentive spatial-temporal pooling networks for
video-based person re-identification. In Proceedings of the
IEEE international conference on computer vision, 4733–
4742.
Yan, Y.; Mao, Y.; and Li, B. 2018. Second: Sparsely embed-
ded convolutional detection. Sensors, 18(10): 3337.
Yang, C.; Chen, Y.; Tian, H.; Tao, C.; Zhu, X.; Zhang, Z.;
Huang, G.; Li, H.; Qiao, Y.; Lu, L.; et al. 2023a. Bevformer
v2: Adapting modern image backbones to bird’s-eye-view
recognition via perspective supervision. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 17830–17839.
Yang, L.; Tang, T.; Li, J.; Chen, P.; Yuan, K.; Wang, L.;
Huang, Y.; Zhang, X.; and Yu, K. 2023b. BEVHeight++:
Toward Robust Visual Centric 3D Object Detection.
ArXiv:2309.16179 [cs].
Yang, L.; Yu, K.; Tang, T.; Li, J.; Yuan, K.; Wang, L.; Zhang,
X.; and Chen, P. 2023c. BEVHeight: A Robust Frame-
work for Vision-based Roadside 3D Object Detection. In
2023 IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), 21611–21620. Vancouver, BC,
Canada: IEEE. ISBN 9798350301298.
Yang, L.; Zhang, X.; Yu, J.; Li, J.; Zhao, T.; Wang,
L.; Huang, Y.; Zhang, C.; Wang, H.; and Li, Y. 2024.
MonoGAE: Roadside Monocular 3D Object Detection With
Ground-Aware Embeddings. IEEE Transactions on Intelli-
gent Transportation Systems, 1–15.
Ye, X.; Shu, M.; Li, H.; Shi, Y.; Li, Y.; Wang, G.; Tan, X.;
and Ding, E. 2022. Rope3d: The roadside perception dataset
for autonomous driving and monocular 3d object detection
task. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, 21341–21350.
Ye, Z.; Zhang, H.; Gu, J.; and Li, X. 2023. YOLOv7-3D:
A Monocular 3D Traffic Object Detection Method from a
Roadside Perspective. Applied Sciences, 13(20): 11402.
Yu, H.; Luo, Y.; Shu, M.; Huo, Y.; Yang, Z.; Shi, Y.; Guo, Z.;
Li, H.; Hu, X.; Yuan, J.; and Nie, Z. 2022. DAIR-V2X: A
Large-Scale Dataset for Vehicle-Infrastructure Cooperative
3D Object Detection. In 2022 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 21329–
21338. IEEE. ISBN 978-1-66546-946-3.
Z. Zou; R. Zhang; S. Shen; G. Pandey; P. Chakravarty; A.
Parchami; and H. X. Liu. 2022. Real-time Full-stack Traffic
Scene Perception for Autonomous Driving with Roadside
Cameras. In 2022 International Conference on Robotics and
Automation (ICRA), 890–896.
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020.
Deformable detr: Deformable transformers for end-to-end
object detection. arXiv preprint arXiv:2010.04159.