Available via license: CC BY 4.0
Content may be subject to copyright.
Categorical Depth Distribution Network for Monocular 3D Object Detection
Cody Reading Ali Harakeh Julia Chae
Steven L. Waslander
University of Toronto
{cody.reading, nayoung.chae}@mail.utoronto.ca, ali.harakeh@utoronto.ca, stevenw@utias.utoronto.ca
Abstract
Monocular 3D object detection is a key problem for
autonomous vehicles, as it provides a solution with sim-
ple configuration compared to typical multi-sensor systems.
The main challenge in monocular 3D detection lies in accu-
rately predicting object depth, which must be inferred from
object and scene cues due to the lack of direct range mea-
surement. Many methods attempt to directly estimate depth
to assist in 3D detection, but show limited performance
as a result of depth inaccuracy. Our proposed solution,
Categorical Depth Distribution Network (CaDDN), uses a
predicted categorical depth distribution for each pixel to
project rich contextual feature information to the appropri-
ate depth interval in 3D space. We then use the computa-
tionally efficient bird’s-eye-view projection and single-stage
detector to produce the final output detections. We design
CaDDN as a fully differentiable end-to-end approach for
joint depth estimation and object detection. We validate
our approach on the KITTI 3D object detection benchmark,
where we rank 1st among published monocular methods. We
also provide the first monocular 3D detection results on the
newly released Waymo Open Dataset. The source code for
CaDDN will be made publicly available before publication.
1. Introduction
Perception in 3D space is a key component in fields such
as autonomous vehicles and robotics, enabling systems to
understand their environment and react accordingly. Li-
DAR [22,68,52,53,31] and stereo [48,47,29,12,58]
sensors have a long history of use for 3D perception
tasks, showing excellent results on 3D object detection
benchmarks such as the KITTI 3D object detection bench-
mark [17] due to their ability to generate precise 3D mea-
surements.
Monocular based 3D perception has been pursued simul-
taneously, motivated by the potential for a low-cost, easy-
to-deploy solution with a single camera [10,42,5,23]. Per-
formance on the same 3D object detection benchmarks lags
(b) (c)
(a)
Figure 1. (a) Input image. (b) Without depth distribution super-
vision, BEV features from CaDDN suffer from smearing effects.
(c) Depth distribution supervision encourages BEV features from
CaDDN to encode meaningful depth confidence, in which objects
can be accurately detected.
significantly relative to LiDAR and stereo methods, due to
the loss of depth information when scene information is pro-
jected onto the image plane.
To combat this effect, monocular object detection meth-
ods [14,38,39,62] often learn depth explicitly, by train-
ing a monocular depth estimation network in a separate
stage. However, depth estimates are consumed directly in
the 3D object detection stage without an understanding of
depth confidence, leading to networks that tend to be over-
confident in depth predictions. Over-confidence in depth is
particularly an issue at long range [62], leading to poor lo-
calization. Further, depth estimation is separated from 3D
detection during the training phase, preventing depth map
estimates from being optimized for the detection task.
Depth information in image data can also be learned im-
plicitly, by directly transforming features from images to
3D space and finally to bird’s-eye-view (BEV) grids [50,
46]. Implicit methods, however, tend to suffer from fea-
ture smearing, wherein similar image features can exist at
1
arXiv:2103.01100v1 [cs.CV] 1 Mar 2021
multiple locations in the projected space. Feature smearing
increases the difficulty of localizing objects in the scene.
To resolve the identified issues, we propose a monocular
3D object detection method, CaDDN, that enables accurate
3D detection by learning categorical depth distributions. By
leveraging probabilistic depth estimation, CaDDN is able
to generate high quality bird’s-eye-view feature representa-
tions from images in an end-to-end fashion. We summarize
our approach with three contributions.
(1) Categorical Depth Distributions. In order to perform
3D detection, we predict pixel-wise categorical depth distri-
butions to accurately locate image information in 3D space.
Each predicted distribution describes the probabilities that
a pixel belongs to a set of predefined depth bins. We en-
courage our distributions to be as sharp as possible around
the correct depth bins, in order to encourage our network
to focus more on image information where depth estima-
tion is both accurate and confident [24]. By doing so, our
network is able to produce sharper and more accurate fea-
tures that are useful for 3D detection (see Figure 1). On
the other hand, our network retains the ability to produce
less sharp distributions when depth estimation confidence is
low. Using categorical distributions allows our feature en-
coding to capture the inherent depth estimation uncertainty
to reduce the impact of erroneous depth estimates, a prop-
erty shown to be key to CaDDN’s improved performance in
Section 4.3. Sharpness in our predicted depth distributions
is encouraged through supervision with one-hot encodings
of the correct depth bin, which can be generated by project-
ing LiDAR depth data into the camera frame.
(2) End-To-End Depth Reasoning. We learn depth dis-
tributions in an end-to-end fashion, jointly optimizing for
accurate depth prediction as well as accurate 3D object de-
tection. We argue that joint depth estimation and 3D detec-
tion reasoning encourages depth estimates to be optimized
for the 3D detection task, leading to increased performance
as shown in Section 4.3.
(3) BEV Scene Representation. We introduce a novel
method to generate high quality bird’s-eye-view scene rep-
resentations from single images using categorical depth dis-
tributions and projective geometry. We select the bird’s-
eye-view representation due to its ability to produce excel-
lent 3D detection performance with high computational ef-
ficiency [27]. The generated bird’s-eye-view representation
is used as input to a bird’s-eye-view based detector to pro-
duce the final output.
CaDDN is shown to rank first among all previously pub-
lished monocular methods on the Car and Pedestrian cate-
gories of the KITTI 3D object detection test benchmark [1],
with margins of 1.69% and 1.46% AP|R40 respectively. We
are the first to report monocular 3D object detection results
on the Waymo Open Dataset [59].
2. Related Work
Monocular Depth Estimation. Monocular depth esti-
mation is performed by generating a single depth value
for every pixel in an image. As such, many monocular
depth estimation methods are based on architectures used
in well-studied pixel-to-pixel mapping problems such as se-
mantic segmentation. As an example, fully convolutional
networks (FCNs) [36] were introduced for semantic seg-
mentation, and were subsequently adopted for monocular
depth estimation [26]. The atrous spatial pyramid pooling
(ASPP) module was also first proposed for semantic seg-
mentation in DeepLab [9,8,7] and subsequently used for
depth estimation in DORN [16] and BTS [28]. Further,
many methods jointly perform depth estimation and seg-
mentation [66,69,61,15] in an end-to-end manner. We
follow the design of the semantic segmentation network
DeepLabV3 [7] for estimating categorical depth distribu-
tions for each pixel in the image.
BEV Semantic Segmentation. BEV segmentation meth-
ods [44,51] attempt to predict BEV semantic maps of 3D
scenes from images. Images can be used to either directly
estimate BEV semantic maps [41,37,63] or to estimate a
BEV feature representation [46,49,43] as an intermediate
step for the segmentation task. In particular, Lift, Splat,
Shoot [46] predicts categorical depth distributions in an un-
supervised manner, in order to generate intermediate BEV
representations. In this work, we predict categorical depth
distributions via supervision with ground truth one-hot en-
codings to generate more accurate depth distributions for
object detection.
Monocular 3D Detection. Monocular 3D object detection
methods often generate intermediate representations to as-
sist in the 3D detection task. Based on these representations,
monocular detection can be divided into three categories:
direct, depth-based, and grid-based methods.
Direct Methods. Direct methods [10,54,4,34] estimate
3D detections directly from images without predicting an
intermediate 3D scene representation. Rather, direct meth-
ods [55,13,42,33,3] can incorporate the geometric rela-
tionship between the 2D image plane and 3D space to assist
with detections. For example, object keypoints can be esti-
mated on the image plane, in order to assist in 3D box con-
struction using known geometry [35,30]. M3D-RPN [3]
introduces depth-aware convolutions that divides the input
row-wise and learns non-shared kernels for each region, to
learn location specific features that correlate to regions in
3D space. Shape estimation can be performed for objects
in the scene to create an understanding of 3D object geom-
etry. Shape estimates can be supervised from labeled ver-
tices of 3D CAD models [5,25], from LiDAR scans [23], or
directly from input data in a self-supervised manner [2]. A
drawback for direct methods is that detections are generated
2
Figure 2. CaDDN Architecture. The network is composed of three modules to generate 3D feature representations and one to perform 3D
detection. Frustum features Gare generated from an image Iusing estimated depth distributions D, which are transformed into voxel
features V. The voxel features are collapsed to bird’s-eye-view features Bto be used for 3D object detection.
directly from 2D images, without access to explicit depth
information, usually resulting in reduced performance in lo-
calization relative to other methods.
Depth-Based Methods. Depth-based methods perform the
3D detection task using pixel-wise depth maps as an addi-
tional input, where the depth maps are precomputed using
monocular depth estimation architectures [16]. Estimated
depth maps can be used in combination with images to per-
form the 3D detection task [40,67,38,14]. Alternatively,
depth maps can be converted to 3D point clouds, commonly
known as Pseudo-LiDAR [62], which are either used di-
rectly [64,6] or combined with image information [65,39]
to generate 3D object detection results. Depth-based meth-
ods separate depth estimation from 3D object detection dur-
ing the training stage, leading to the learning of sub-optimal
depth maps used for the 3D detection task. Accurate depth
should be prioritized for pixels belonging to objects of inter-
est, and is less important for background pixels, a property
that is not captured if depth estimation and object detection
are trained independently.
Grid-Based Methods. Grid-based methods avoid estimating
raw depth values by predicting a BEV grid [50,57] repre-
sentation, to be used as input for 3D detection architectures.
Specifically, OFT [50] populates a voxel grid by projecting
voxels into the image plane and sampling image features,
to be transformed into a BEV representation. Multiple vox-
els can be projected to the same image feature, leading to
repeated features along the projection ray and reduced de-
tection accuracy.
CaDDN addresses all identified issues by jointly per-
forming depth estimation and 3D object detection in an end-
to-end manner, and leverages the depth estimates to gener-
ate meaningful bird’s-eye-view representations with accu-
rate and localized features.
3. Methodology
CaDDN learns to generate BEV representations from
images by projecting image features into 3D space. 3D
object detection is then performed with the rich BEV rep-
resentation using an efficient BEV detection network. An
overview of CaDDN’s architecture is shown in Figure 2.
3.1. 3D Representation Learning
Our network learns to produce BEV representations that
are well-suited for the task of 3D object detection. Tak-
ing an image as input, we construct a frustum feature grid
using the estimated categorical depth distributions. The
frustum feature grid is transformed into a voxel grid using
known camera calibration parameters, and then collapsed to
a bird’s-eye-view feature grid.
Frustum Feature Network. The purpose of the frustum
feature network is to project image information into 3D
space, by associating image features to estimated depths.
Specifically, the input to the frustum feature network is an
image I∈RWI×HI×3, where WI, HIare the width and
height of the image. The output is a frustum feature grid
3
Figure 3. Each feature pixel F(u, v)is weighted by its depth dis-
tribution probabilities D(u, v)of belonging to Ddiscrete depth
bins to generate frustum features G(u, v).
G∈RWF×HF×D×C, where WF, HF,are the width and
height of the image feature representation, Dis the number
of discretized depth bins, and Cis the number of feature
channels. We note that the structure of the frustum grid is
similar to the plane-sweep volume used in the stereo 3D de-
tection method DSGN [12].
A ResNet-101 [18] backbone is used to extract image
features ˜
F∈RWF×HF×C(see Image Backbone in Fig-
ure 2). In our implementation, we extract the image fea-
tures from Block1 of the ResNet-101 backbone in order to
maintain a high spatial resolution. A high spatial resolution
is necessary for an effective frustum to voxel grid transfor-
mation, such that the frustum grid can be finely sampled
without repeated features.
The image features ˜
Fare used to estimate pixel-wise cat-
egorical depth distributions D∈RWF×HF×D, where the
categories are the Ddiscretized depth bins. Specifically, we
predict Dprobabilities for each pixel in the image features
˜
F, where each probability indicates the network’s confi-
dence that depth value belongs to a specified depth bin. The
definition of the depth bins relies on the depth discretization
method as discussed in Section 3.3.
We follow the design of the semantic segmentation net-
work DeepLabV3 [7] to estimate the categorical depth dis-
tributions from image features ˜
F(Depth Distribution Net-
work in Figure 2), where we modify the network to produce
pixel-wise probability scores of belonging to depth bins
rather than semantic classes with a downsample-upsample
architecture. Image features ˜
Fare downsampled with the
remaining components of the ResNet-101 [18] backbone
(Block2,Block3, and Block4). An atrous spatial pyramid
pooling [7] (ASPP) module is applied to capture multi-scale
information, where the number of output channels is set as
D. The output of the ASPP module is upsampled to the
original feature size with bilinear interpolation to produce
the categorical depth distributions D∈RWF×HF×D. A
softmax function is applied for each pixel to normalize the
Dlogits into probabilities between 0and 1.
Figure 4. Sampling points in each voxel are projected into the frus-
tum grid. Frustum features are sampled using trilinear interpola-
tion (shown as blue in G) to populate voxels in V.
In parallel to estimating depth distributions, we perform
channel reduction (Image Channel Reduce in Figure 2) on
image features ˜
Fto generate the final image features F, us-
ing a 1x1 convolution + BatchNorm + ReLU layer to reduce
the number of channels from C= 256 to C= 64. Channel
reduction is required to reduce the high memory footprint of
ResNet-101 features that will be populated in the 3D frus-
tum grid.
Let (u, v, c)represent a coordinate in image features F
and (u, v, di)represent a coordinate in categorical depth
distributions D, where (u, v)are the feature pixel loca-
tion, cis the channel index, and diis the depth bin index.
To generate a frustum feature grid G, each feature pixel
F(u, v)is weighted by its associated depth bin probabilities
in D(u, v)to populate the depth axis di, visualized in Fig-
ure 3. Feature pixels can be weighted by depth probability
using the outer product, defined as:
G(u, v) = D(u, v)⊗F(u, v )(1)
where D(u, v)is the predicted depth distribution and
G(u, v)is an output matrix of size D×C. The outer prod-
uct in Equation 1is computed for each pixel to form frustum
features G∈RWF×HF×D×C.
Frustum to Voxel Transformation. The frustum features
G∈RWF×HF×D×Care transformed to a voxel represen-
tation V∈RX×Y×Z×Cleveraging known camera calibra-
tion and differentiable sampling, shown in Figure 4. Voxel
sampling points sv
k= [x, y, z]T
kare generated at the center
of each voxel and transformed to the frustum grid to form
frustum sampling points ˜sf
k= [u, v, dc]T
k, where dcis the
continuous depth value along the frustum depth axis di. The
transformation is performed using the camera calibration
matrix P∈R3×4. Each continuous depth value dcis con-
verted to a discrete depth bin index diusing the depth dis-
cretization method outlined in Section 3.3. Frustum features
in Gare sampled using sampling points sf
k= [u, v, di]T
k
with trilinear interpolation (shown in blue in Figure 4) to
populate voxel features in V.
4
The spatial resolution of the frustum grid Gand the
voxel grid Vshould be similar for an effective transforma-
tion. A high resolution voxel grid Vleads to a high density
of sampling points that will oversample a low resolution
frustum grid, resulting in a large amount of similar voxel
features. Therefore, we extract the features ˜
Ffrom Block1
of the ResNet-101 backbone to ensure our frustum grid G
is of high spatial resolution.
Voxel Collapse to BEV. The voxel features V∈
RX×Y×Z×Care collapsed to a single height plane to gen-
erate bird’s-eye-view features B∈RX×Y×C. BEV grids
greatly reduce the computational overhead while offering
similar detection performance to 3D voxel grids [27], moti-
vating their use in our network. We concatenate the vertical
axis zof the voxel grid Valong the channel dimension cto
form a BEV grid ˜
B∈RX×Y×Z∗C. The number of chan-
nels are reduced using a 1x1 convolution + BatchNorm +
ReLU layer (see BEV Channel Reduce in Figure 2), which
retrieves the original number of channels Cwhile learning
the relative importance of each height slice, resulting in a
BEV grid B∈RX×Y×C.
3.2. BEV 3D Object Detection
To perform 3D object detection on the BEV feature grid,
we adopt the backbone and detection head of the well-
established BEV 3D object detector PointPillars [27], as
it has been shown to provide accurate 3D detection results
with a low computational overhead. For the BEV backbone,
we increase the number of 3x3 convolution + BatchNorm +
ReLU layers in the downsample blocks from (4, 6, 6) used
in the original PointPillars [27] to (10, 10, 10) for Block1,
Block2, and Block3 respectively. Increasing the number of
convolutional layers expands the learning capacity in our
BEV network, important for learning from lower quality
features produced by images compared to higher quality
features originally produced by LiDAR point clouds. We
use the same detection head as PointPillars [27] to generate
our final detections.
3.3. Depth Discretization
The continuous depth space is discretized in order to de-
fine the set of Dbins used in the depth distributions D.
Depth discretization can be performed with uniform dis-
cretization (UD) with a fixed bin size, spacing-increasing
discretization (SID) [16] with increasing bin sizes in log
space, or linear-increasing discretization (LID) [60] with
linearly increasing bin sizes. Depth discretization tech-
niques are visualized in Figure 5. We adopt LID as our
depth discretization as it provides balanced depth estima-
tion for all depths [60]. LID is defined as:
dc=dmin +dmax −dmin
D(D+ 1) ·di(di+ 1) (2)
Figure 5. Depth Discretization Methods. Depth dcis discretized
over a depth range [dmin, dmax]into Ndiscrete bins. Commonly
used methods include uniform (UD), spacing-increasing (SID),
and linear-increasing (LID) discretization.
where dcis the continuous depth value, [dmin, dmax ]is the
full depth range to be discretized, Dis the number of depth
bins, and diis the depth bin index.
3.4. Depth Distribution Label Generation
We require depth distribution labels ˆ
Din order to super-
vise our predicted depth distributions. Depth distribution
labels are generated by projecting LiDAR point clouds into
the image frame to create sparse dense maps. Depth com-
pletion [21] is performed to generate depth values at each
pixel in the image. We require depth information at each
image feature pixel, so we downsample the depth maps of
size WI×HIto the image feature size WF×HF. The
depth maps are converted to bin indices using the LID dis-
cretization method described in Section 3.3, followed by a
conversion into a one-hot encoding to generate the depth
distribution labels. A one-hot encoding ensures the depth
distribution labels are sharp, essential to encourage sharp-
ness in our depth distribution predictions via supervision.
3.5. Training Losses
Generally, classification is performed by predicting cate-
gorical distributions, and encouraging sharpness in the dis-
tribution in order to select the correct class [20]. We lever-
age classification to encourage a single correct depth bin
when supervising the depth distribution network, using the
focal loss [32]:
Ldepth =1
WF·HF
WF
X
u=1
HF
X
v=1
FL(D(u, v),ˆ
D(u, v)) (3)
where Dis the depth distribution predictions and ˆ
Dis the
depth distribution labels. We found that autonomous driv-
ing datasets contain images with fewer object pixels than
background pixels, leading to loss functions that priori-
tize background pixels when all pixel losses are weighted
5
Car (IOU = 0.7) Pedestrian (IOU = 0.5) Cyclist (IOU = 0.5)
Method Frames Easy Mod. Hard Easy Mod. Hard Easy Mod. Hard
Kinematic3D [4] 4 19.07 12.72 9.17 – – – – – –
OFT [50] 1 1.61 1.32 1.00 0.63 0.36 0.35 0.14 0.06 0.07
ROI-10D [40] 1 4.32 2.02 1.46 – – – – – –
MonoPSR [23] 1 10.76 7.25 5.85 6.12 4.00 3.30 8.37 4.74 3.68
Mono3D-PLiDAR [64] 1 10.76 7.50 6.10 – – – – – –
MonoDIS [54] 1 10.37 7.94 6.40 – – – – – –
UR3D [67] 1 15.58 8.61 6.00 – – – – – –
M3D-RPN [3] 1 14.76 9.71 7.42 4.92 3.48 2.94 0.94 0.65 0.47
SMOKE [35] 1 14.03 9.76 7.84 – – – – – –
MonoPair [13] 1 13.04 9.99 8.65 10.02 6.68 5.53 3.79 2.12 1.83
RTM3D [30] 1 14.41 10.34 8.77 – – – – – –
AM3D [39] 1 16.50 10.74 9.52 – – – – – –
MoVi-3D [55] 1 15.19 10.90 9.26 8.99 5.44 4.57 1.08 0.63 0.70
RAR-Net [34] 1 16.37 11.01 9.52 – – – – – –
PatchNet [38] 1 15.68 11.12 10.17 – – – – – –
DA-3Ddet [6] 1 16.77 11.50 8.93 – – – – – –
D4LCN [14] 1 16.65 11.72 9.51 4.55 3.42 2.83 2.45 1.67 1.36
CaDDN (ours) 119.17 13.41 11.46 12.87 8.14 6.76 7.00 3.41 3.30
Improvement –+2.40 +1.69 +1.29 +2.85 +1.46 +1.23 -1.37 -1.33 -0.38
Table 1. Comparative Results on the KITTI [17]test set. Results are shown using the AP|R40 metric only for results that are readily
available. We indicate the highest result with red and the second highest with blue.
evenly. We set the focal loss [32] weighting factor αas
αfg = 3.25 for foreground object pixels and αbg = 0.25
for background pixels. Foreground object pixels are deter-
mined as all pixels that lie within 2D object bounding box
labels, and background pixels are all remaining pixels. We
set the focal loss [32] focusing parameter γ= 2.0.
We use the classification loss Lcls , regression loss Lreg ,
and direction classification loss Ldir from PointPillars [27]
for 3D object detection. The total loss of our network is the
combination of the depth and 3D detection losses:
L=λdepthLdepth +λcls Lcls +λreg Lreg +λdirLdir (4)
where λdepth, λcls , λreg , λdir are fixed loss weighting fac-
tors.
4. Experimental Results
To demonstrate the effectiveness of CaDDN we present
results on both the KITTI 3D object detection bench-
mark [17] and the Waymo Open Dataset [59].
The KITTI 3D object detection benchmark [17] is di-
vided into 7,481 training samples and 7,518 testing sam-
ples. The training samples are commonly divided into a
train set (3,712 samples) and a val set (3,769 samples)
following [11], which is also adopted here. We compare
CaDDN with existing methods on the test set by training
our model on both the train and val sets. We evaluate on the
val set for ablation by training our model on only the train
set.
The Waymo Open Dataset [59] is a more recently re-
leased autonomous driving dataset, which consists of 798
training sequences and 202 validation sequences. The
dataset also includes 150 test sequences without ground
truth data. The dataset provides object labels in the full
360◦field of view with a multi-camera rig. We only use the
front camera and only consider object labels in the front-
camera’s field of view (50.4◦) for the task of monocular
object detection, and provide results on the validation se-
quences. We sample every 3rd frame from the training se-
quences to form our training set (51,564 samples) due to the
large dataset size and high frame rate.
Input Parameters. The voxel grid is defined by a
range and voxel size in 3D space. On KITTI [17],
we use [2,46.8] ×[−30.08,30.08] ×[−3,1] (m) for the
range and [0.16,0.16,0.16] (m) for the voxel size for
the x,y, and zaxes respectively. On Waymo, we use
[2,55.76] ×[−25.6,25.6] ×[−4,4] (m) for the range and
[0.16,0.16,0.16] (m) for the voxel size. Additionally, we
downsample Waymo images to 1248 ×832.
Training and Inference Details. Our method is imple-
mented in PyTorch [45]. The network is trained on a
NVIDIA Tesla V100 (32G) GPU. The Adam [19] optimizer
is used with an initial learning rate of 0.001 and is modi-
fied using the one-cycle learning rate policy [56]. We train
the model for 80 epochs on the KITTI dataset [17] and 10
epochs on the Waymo Open Dataset [59]. We use a batch
size of 4 for KITTI [17] and a batch size of 2 for Waymo.
The values λdepth = 3.0, λcls = 1.0, λreg = 2.0, λdir =
6
3D mAP 3D mAPH
Difficulty Method Overall 0 - 30m 30 - 50m 50m - ∞Overall 0 - 30m 30 - 50m 50m - ∞
M3D-RPN [3] 0.35 1.12 0.18 0.02 0.34 1.10 0.18 0.02
CaDNN (Ours) 5.03 14.54 1.47 0.10 4.99 14.43 1.45 0.10
LEVEL 1
(IOU = 0.7) Improvement +4.69 +13.43 +1.28 +0.08 +4.65 +13.33 +1.28 +0.08
M3D-RPN [3]0.33 1.12 0.18 0.02 0.33 1.10 0.17 0.02
CaDNN (Ours) 4.49 14.50 1.42 0.09 4.45 14.38 1.41 0.09
LEVEL 2
(IOU = 0.7) Improvement +4.15 +13.38 +1.24 +0.07 +4.12 +13.28 +1.24 +0.07
M3D-RPN [3] 3.79 11.14 2.16 0.26 3.63 10.70 2.09 0.21
CaDNN (Ours) 17.54 45.00 9.24 0.64 17.31 44.46 9.11 0.62
LEVEL 1
(IOU = 0.5) Improvement +13.76 +33.86 +7.08 +0.39 +13.69 +33.77 +7.02 +0.41
M3D-RPN [3] 3.61 11.12 2.12 0.24 3.46 10.67 2.04 0.20
CaDNN (Ours) 16.51 44.87 8.99 0.58 16.28 44.33 8.86 0.55
LEVEL 2
(IOU = 0.5) Improvement +12.89 +33.75 +6.87 +0.34 +12.82 +33.66 +6.81 +0.36
Table 2. Results on the Waymo Open Dataset Validation Set on the Vehicle class. We implement a M3D-RPN [3] baseline for comparison.
0.2are used for the loss weighting factors in Equation 4. We
employ horizontal flip as our data augmentation and train
one model for all classes. During inference, we filter boxes
with a score threshold of 0.1 and apply non-maximum sup-
pression (NMS) with an IoU threshold of 0.01.
4.1. KITTI Dataset Results
Results on the KITTI dataset [17] are evaluated using
average precision (AP|R40). The evaluation is separated by
difficulty settings (Easy, Moderate, and Hard) and by object
class (Car, Pedestrian, and Cyclist). The Car class has an
IoU criteria of 0.7 while the Pedestrian and Cyclist classes
have an IoU criteria of 0.5, where IoU criteria is a threshold
to be considered a true positive detection.
Table 1shows the results of CaDDN on the KITTI [17]
test set compared to state-of-the-art published monocu-
lar methods, listed in rank order of performance on the
Car class at the Moderate difficulty setting. We note that
our method outperforms previous single frame methods by
large margins on AP|R40 of +2.40%, +1.69%, and +1.29%
on the Car class on the Easy, Moderate, and Hard dif-
ficulties respectively. Additionally, CaDDN ranks higher
than the multi-frame method Kinematic3D [4]. Our method
also outperforms the previous state-of-the art method on the
Pedestrian class MonoPair [13] with margins on AP|R40 of
+2.85%, +1.46%, and +1.23%. Our method achieves sec-
ond place on the Cyclist class with margins on AP|R40 of
-1.37%, -1.33%, and -0.38% relative to MonoPSR [23].
4.2. Waymo Dataset Results
We adopt the officially released evaluation to calculate
the mean average precision (mAP) and the mean average
precision weighted by heading (mAPH) on the Waymo
Open Dataset [59]. The evaluation is separated by difficulty
setting (LEVEL 1, LEVEL 2) and distance to the sensor (0
- 30m, 30 - 50m, and 50m-∞). We evaluate on the Vehicle
class with an IoU criteria of 0.7 and 0.5.
Exp. DLdepth αfg LID Car (IOU = 0.7)
Easy Mod. Hard
1 7.83 5.66 4.84
2X9.33 6.43 5.30
3X X 19.73 14.03 11.84
4X X X 20.40 15.10 12.75
5X X X X 23.57 16.31 13.84
Table 3. CaDDN Ablation Experiments on the KITTI val set using
AP|R40 .Dindicates depth distribution prediction, Ldepth indi-
cates depth distribution supervision. αfg indicates separate setting
of loss weighting factor for foreground object pixels in the depth
loss function Ldepth. LID indicates the LID discretization method.
Exp. D D ⊗FCar (IOU = 0.7)
Easy Mod. Hard
1 BTS [28] Single 16.69 10.18 8.63
2 DORN [16] Single 16.43 11.04 9.65
3 CaDDN (Ours) Single 20.61 13.71 11.96
4 CaDDN (Ours) Full 23.57 16.31 13.84
Table 4. CaDDN Depth Estimation Ablation on the KITTI val set
using the AP|R40.Dindicates the source of the depth estimates
used to generate depth distributions D.D⊗Findicates whether
a single bin or the full distribution is used to generate frustum fea-
tures G.
To the best of our knowledge, no monocular methods
have reported results on the Waymo dataset. Therefore,
we implement and evaluate M3D-RPN [3] on the Waymo
dataset as a comparison point. Table 2shows the results of
both the M3D-RPN [3] baseline and CaDDN on the Waymo
validation set. Our method significantly outperforms M3D-
RPN [3] with margins on AP/APH of +4.69%/+4.65% and
+4.15%/+4.12% on the LEVEL 1 and LEVEL 2 difficulties
respectively for an IoU criteria of 0.7.
4.3. Ablation Studies
We provide ablation studies on individual components of
our network to validate our design choices. The results are
shown in Tables 3and 4.
7
Sharpness in Depth Distributions. Experiment 1in Ta-
ble 3shows the detection performance when frustum fea-
tures Gare populated by repeating image features Falong
depth axis di. Experiment 2adds depth distribution predic-
tions Dto separately weigh image features F, which im-
proves performance on AP|R40 by +1.50%, +0.77%, and
+0.46% on the Car class on the Easy, Moderate, and Hard
difficulties respectively. Performance is greatly increased
(+10.40%, +7.60%, +6.54%) once depth distribution super-
vision is added in Experiment 3validating its inclusion. The
addition of depth distribution supervision encourages sharp
and accurate categorical depth distributions, that encour-
ages image information to be located in 3D space where
depth estimation is both accurate and confident. Encour-
aging sharpness around correct depth bins results in object
features that are uniquely located and easily distinguished
(see Figure 1) in the BEV projection.
Object Weighting for Depth Distribution Estimation.
All experiments in Table 3up to Experiment 3use a fixed
loss weighting factor α= 0.25 for all pixels in the depth
loss function Ldepth. Experiment 4shows an improvement
(+0.67%, +1.07%, +0.91%) after the separation of depth
loss weighting αfg = 3.25/αbg = 0.25 for foreground
object and background pixels as described in Section 3.5.
Setting a larger foreground object weighting factor αfg en-
courages depth estimation to be prioritized for foreground
object pixels, leading to more accurate depth estimation and
localization for objects.
Linear Increasing Discretization. Experiment 5in Table 3
shows the detection performance improvement (+3.17%,
+1.21%, +1.09%) when LID (see Section 3.3) is used rather
than uniform discretization. We attribute the performance
increase to the accurate depth estimation LID provides
across all depths [60].
Joint Depth Understanding. Experiments 1and 2in Ta-
ble 4show the detection performance with separate depth
estimation from BTS [28] and DORN [16] respectively. The
depth maps are converted to depth bin indices using LID
discretization as outlined in Section 3.3, and converted to
a one-hot encoding to generate the categorical depth dis-
tributions D. The one-hot encoding places the image fea-
ture at only a single depth bin indicated by the input depth
map. Experiment 3shows improved performance (+4.18%,
+2.67%, +2.31%) when depth estimation and object detec-
tion are performed jointly, which we attribute to the well-
known benefits of multi-task losses and end-to-end learning
for 3D detection.
Categorical Depth Distributions. Experiment 3in Table 4
uses only a single depth bin when generating frustum fea-
tures G, by selecting the bin with the highest estimated
probability for each pixel. Experiment 4in Table 4uses the
full depth distribution Din the frustum features computa-
Figure 6. We plot the entropy of the estimated depth distributions
Dagainst depth. We show both the mean (solid line) and 95%
confidence interval (shaded region) at each ground truth depth bin.
tion G=D⊗F, leading to a clear increase in performance
(+2.96%, 2.60%, 1.88%). We attribute the performance in-
crease to the additional depth uncertainty information em-
bedded in the feature representations.
4.4. Depth Distribution Uncertainty
To validate that our depth distributions contain meaning-
ful uncertainty information, we compute the Shanon en-
tropy for each estimated categorical depth distribution in
D. We label each distribution with its associated ground
truth depth bin and foreground/background classification.
For each group, we compute the entropy statistics which
are shown in Figure 6. We observe that entropy generally
increases as a function of depth, where depth estimates are
challenging, indicating our distributions describe meaning-
ful uncertainty information. Our network produces the low-
est distribution entropy at pixels with ground truth depth of
around 6 meters. We attribute the high entropy at depths
closer than 6 meters to the small number of pixels at shorter
ranges in the training set. Finally, we note that the fore-
ground depth distribution estimates have slightly higher en-
tropy than background pixels, a phenomenon that can also
be attributed to training set imbalance.
5. Conclusion
We have presented CaDDN, a novel monocular 3D
object detection method that estimates accurate cate-
gorical depth distributions for each pixel. The depth
distributions are combined with the image features
to generate bird’s-eye-view representations that retain
depth confidence, to be exploited for 3D object detec-
tion. We have shown that estimating sharp categorical
distributions centered around the correct depth value,
and jointly performing depth estimation and object de-
tection is vital for 3D object detection performance,
leading to a 1st place ranking on the KITTI dataset [1]
among all published methods at the time of submission.
8
References
[1] Kitti’s 3d object detection evaluation benchmark 2017.
http : / / www . cvlibs . net / datasets / kitti /
eval_object.php?obj_benchmark=3d. Accessed
on 15.11.2020. 2,8
[2] Deniz Beker, Hiroharu Kato, Mihai Adrian Morariu,
Takahiro Ando, Toru Matsuoka, Wadim Kehl, and Adrien
Gaidon. Monocular differentiable rendering for self-
supervised 3d object detection. ECCV, 2020. 2
[3] Garrick Brazil and Xiaoming Liu. M3D-RPN: monocular 3D
region proposal network for object detection. ICCV, 2019. 2,
6,7
[4] Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt
Schiele. Kinematic 3d object detection in monocular video.
ECCV, 2020. 2,6,7
[5] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa,
C´
eline Teuli`
ere, and Thierry Chateau. Deep MANTA: A
coarse-to-fine many-task network for joint 2d and 3D vehicle
analysis from monocular image. CVPR, 2017. 1,2
[6] CLele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou,
Yi Xu, and Chenliang Xu. Monocular 3d object detection via
feature domain adaptation. ECCV, 2020. 3,6
[7] Liang-Chieh Chen, George Papandreou, Florian Schroff, and
Hartwig Adam. Rethinking atrous convolution for semantic
image segmentation. arXiv preprint, 2017. 2,4
[8] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
Kevin Murphy, and Alan L. Yuille. Semantic image segmen-
tation with deep convolutional nets and fully connected crfs.
ICLR, 2015. 2
[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic im-
age segmentation with deep convolutional nets, atrous con-
volution, and fully connected crfs. arXiv preprint, 2016. 2
[10] Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma1,
Sanja Fidler, and Raquel Urtasun. Monocular 3d object de-
tection for autonomous driving. CVPR, 2016. 1,2
[11] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G
Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun.
3d object proposals for accurate object class detection. NIPS,
2015. 6
[12] Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Dsgn:
Deep stereo geometry network for 3d object detection.
CVPR, 2020. 1,4
[13] Yongjian Chen, Lei Tai, Kai Sun, and Mingyang Li.
Monopair: Monocular 3d object detection using pairwise
spatial relationships. CVPR, 2020. 2,6,7
[14] Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping
Shi, Zhiwu Lu, and Ping Luo. Learning depth-guided convo-
lutions for monocular 3d object detection. CVPR, 2020. 1,
3,6
[15] David Eigen and Rob Fergus. Predicting depth, surface nor-
mals and semantic labels with a common multi-scale convo-
lutional architecture. ICCV, 2015. 2
[16] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat-
manghelich, and Dacheng Tao. Deep ordinal regression net-
work for monocular depth estimation. CVPR, 2018. 2,3,5,
7,8
[17] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
ready for autonomous driving? the kitti vision benchmark
suite. CVPR, 2012. 1,6,7
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. CVPR, 2016.
4
[19] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. ICLR, 2015. 6
[20] S. B. Kotsiantis. Supervised machine learning: A review of
classification techniques. In Proceedings of the 2007 Con-
ference on Emerging Artificial Intelligence Applications in
Computer Engineering: Real Word AI Systems with Applica-
tions in EHealth, HCI, Information Retrieval and Pervasive
Technologies, NLD, 2007. IOS Press. 5
[21] Jason Ku, Ali Harakeh, and Steven Lake Waslander. In de-
fense of classical image processing: Fast depth completion
on the CPU. CRV, 2018. 5
[22] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh,
and Steven Lake Waslander. Joint 3D proposal generation
and object detection from view aggregation. IROS, 2018. 1
[23] Jason Ku, Alex D. Pon, and Steven L. Waslander. Monocular
3D object detection leveraging accurate proposals and shape
reconstruction. CVPR, 2019. 1,2,6,7
[24] Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon.
Accurate uncertainties for deep learning using calibrated re-
gression. arXiv preprint arXiv:1807.00263, 2018. 2
[25] Abhijit Kundu, Yin Li, and James M. Rehg. 3D-RCNN:
Instance-level 3D object reconstruction via render-and-
compare. CVPR, 2018. 2
[26] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed-
erico Tombari, and Nassir Navab. Deeper depth prediction
with fully convolutional residual networks. 3DV, 2016. 2
[27] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
Jiong Yang, and Oscar Beijbom. PointPillars: Fast encoders
for object detection from point clouds. CVPR, 2019. 2,5,6
[28] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong
Suh. From big to small: Multi-scale local planar guidance
for monocular depth estimation. arXiv preprint, 2019. 2,7,
8
[29] Chengyao Li, Jason Ku, and Steven L. Waslander. Confi-
dence guided stereo 3d object detection with split depth esti-
mation. IROS, 2020. 1
[30] Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao.
Rtm3d: Real-time monocular 3d detection from object key-
points for autonomous driving. ECCV, 2020. 2,6
[31] Ming Liang, Bin Yang, Yun Chen, Rui Hu, and Raquel Urta-
sun. Multi-task multi-sensor fusion for 3D object detection.
CVPR, 2019. 1
[32] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He,
and Piotr Doll´
ar. Focal loss for dense object detection. PAMI,
2018. 5,6
[33] Lijie Liu, Jiwen Lu, Chunjing Xu, Qi Tian, and Jie Zhou.
Deep fitting degree scoring network for monocular 3D object
detection. CVPR, 2019. 2
[34] Lijie Liu, Chufan Wu, Jiwen Lu, Lingxi Xie, Jie Zhou, and
Qi Tian. Reinforced axial refinement network for monocular
3d object detection. ECCV, 2020. 2,6
9
[35] Zechen Liu, Zizhang Wu, and Roland Toth. Smoke: Single-
stage monocular 3d object detection via keypoint estimation.
CVPRW, 2020. 2,6
[36] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. CVPR,
2015. 2
[37] Chenyang Lu, Gijs Dubbelman, and Marinus Jacobus Ger-
ardus van de Molengraft. Monocular semantic occupancy
grid mapping with convolutional variational auto-encoders.
ICRA, 2019. 2
[38] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu
Zeng, and Wanli Ouyang. Rethinking pseudo-lidar represen-
tation. ECCV, 2020. 1,3,6
[39] Xinzhu Ma, Zhihui Wang, Haojie Li, Wanli Ouyang, and
Pengbo Zhang. Accurate monocular 3D object detection via
color-embedded 3D reconstruction for autonomous driving.
ICCV, 2019. 1,3,6
[40] Fabian Manhardt, Wadim Kehl, and Adrien Gaidon. ROI-
10D: monocular lifting of 2d detection to 6d pose and metric
shape. CVPR, 2019. 3,6
[41] Kaustubh Mani, Swapnil Daga, Shubhika Garg, Sai Shankar
Narasimhan, Madhava Krishna, and Krishna Murthy Jataval-
labhula. Monolayout: Amodal scene layout from a single
image. WACV, 2020. 2
[42] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and
Jana Kosecka. 3d bounding box estimation using deep learn-
ing and geometry. CVPR, 2016. 1,2
[43] Mong H. Ng, Kaahan Radia, Jianfei Chen, Dequan Wang,
Ionel Gog, and Joseph E. Gonzalez. Bev-seg: Bird’s eye
view semantic segmentation using geometry and semantic
point cloud. CVPRW, 2020. 2
[44] Bowen Pan, Jiankai Sun, Alex Andonian, Aude Oliva, and
Bolei Zhou. Cross-view semantic segmentation for sensing
surroundings. ICRA, 2019. 2
[45] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,
Andreas Kopf, Edward Yang, Zachary DeVito, Martin Rai-
son, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An
imperative style, high-performance deep learning library. In
NeurIPS. Curran Associates, Inc., 2019. 6
[46] Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding
images from arbitrary camera rigs by implicitly unprojecting
to 3d. ECCV, 2020. 1,2
[47] Alex D. Pon, Jason Ku, Chengyao Li, and Steven L. Waslan-
der. Object-centric stereo matching for 3d object detection.
ICRA, 2020. 1
[48] Rui Qian, Divyansh Garg, Yan Wang, Yurong You, Serge Be-
longie, Bharath Hariharan, Mark Campbell, Kilian Q. Wein-
berger, and Wei-Lun Chao. End-to-end pseudo-lidar for
image-based 3d object detection. CVPR, 2020. 1
[49] Thomas Roddick and Roberto Cipolla. Predicting semantic
map representations from images using pyramid occupancy
networks. CVPR, 2020. 2
[50] Thomas Roddick, Alex Kendall, and Roberto Cipolla. Ortho-
graphic feature transform for monocular 3D object detection.
BMVC, 2018. 1,3,6
[51] Samuel Schulter, Menghua Zhai, Nathan Jacobs, and Man-
mohan Chandraker. Learning to look around objects for top-
view representations of outdoor scenes. ECCV, 2018. 2
[52] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping
Shi, Xiaogang Wang, and Hongsheng Li. PV-RCNN: Point-
voxel feature set abstraction for 3D object detection. CVPR,
2020. 1
[53] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointR-
CNN: 3D object proposal generation and detection from
point cloud. CVPR, 2019. 1
[54] Andrea Simonelli, Samuel Rota Bul `
o, Lorenzo Porzi,
Manuel L´
opez-Antequera, and Peter Kontschieder. Disen-
tangling monocular 3d object detection. ICCV, 2019. 2,6
[55] Andrea Simonelli, Samuel Rota Bul`
o, Lorenzo Porzi, Elisa
Ricci, and Peter Kontschieder. Towards generalization across
depth for monocular 3d object detection. ECCV, 2020. 2,6
[56] Leslie N. Smith. A disciplined approach to neural network
hyper-parameters: Part 1 - learning rate, batch size, momen-
tum, and weight decay. arXiv preprint, 2018. 6
[57] Siddharth Srivastava, Fr´
ed´
eric Jurie, and Gaurav Sharma.
Learning 2d to 3d lifting for object detection in 3d for au-
tonomous vehicles. arXiv preprint, 2019. 3
[58] Jiaming Sun, Linghao Chen, Yiming Xie, Siyu Zhang, Qin-
hong Jiang, Xiaowei Zhou, and Hujun Bao. Disp r-cnn:
Stereo 3d object detection via shape prior guided instance
disparity estimation. CVPR, 2020. 1
[59] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han,
Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et-
tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang,
Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov.
Scalability in perception for autonomous driving: Waymo
open dataset, 2019. 2,6,7
[60] Yunlei Tang, Sebastian Dorn, and Chiragkumar Savani. Cen-
ter3d: Center-based monocular 3d object detection with joint
depth understanding. arXiv preprint, 2020. 5,8
[61] Lijun Wang, Jianming Zhang, Oliver Wang, Zhe Lin, and
Huchuan Lu. Sdc-depth: Semantic divide-and-conquer net-
work for monocular depth estimation. CVPR, 2020. 2
[62] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hari-
haran, Mark Campbell, and Kilian Q. Weinberger. Pseudo-
LiDAR from visual depth estimation: Bridging the gap in 3D
object detection for autonomous driving. CVPR, 2019. 1,3
[63] Ziyan Wang, Buyu Liu, Samuel Schulter, and Manmohan
Chandraker. A parametric top-view representation of com-
plex road scenes. CVPR, 2019. 2
[64] Xinshuo Weng and Kris Kitani. Monocular 3d object detec-
tion with pseudo-lidar point cloud. ICCVW, 2019. 3,6
[65] Bin Xu and Zhenzhong Chen. Multi-level fusion based 3D
object detection from monocular images. CVPR, 2018. 3
[66] Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe.
Pad-net: Multi-tasks guided prediction-and-distillation net-
work for simultaneous depth estimation and scene parsing.
CVPR, 2018. 2
[67] Tae-Kyun Kim Xuepeng Shi, Zhixiang Chen. Distance-
normalized unified representation for monocular 3d object
detection. ECCV, 2020. 3,6
10