Conference PaperPDF Available

3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces


Abstract and Figures

3D Shape representation has substantial effects on 3D shape reconstruction. Primitive-based representations approximate a 3D shape mainly by a set of simple implicit primitives, but the low geometrical complexity of the primitives limits the shape resolution. Moreover, setting a sufficient number of primitives for an arbitrary shape is challenging. To overcome these issues, we propose a constrained implicit algebraic surface as the primitive with few learnable coefficients and higher geometrical complexities and a deep neural network to produce these primitives. Our experiments demonstrate the superiorities of our method in terms of representation power compared to the state-of-the-art methods in single RGB image 3D shape reconstruction. Furthermore, we show that our method can semantically learn segments of 3D shapes in an unsupervised manner. The code is publicly available from this link.
Content may be subject to copyright.
3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces
Mohsen Yavartanoo*Jaeyoung Chung*Reyhaneh Neshatavar Kyoung Mu Lee
ASRI, Department of ECE, Seoul National University, Seoul, Korea
3D Shape representation has substantial effects on 3D
shape reconstruction. Primitive-based representations ap-
proximate a 3D shape mainly by a set of simple implicit
primitives, but the low geometrical complexity of the prim-
itives limits the shape resolution. Moreover, setting a suffi-
cient number of primitives for an arbitrary shape is chal-
lenging. To overcome these issues, we propose a con-
strained implicit algebraic surface as the primitive with few
learnable coefficients and higher geometrical complexities
and a deep neural network to produce these primitives. Our
experiments demonstrate the superiorities of our method in
terms of representation power compared to the state-of-the-
art methods in single RGB image 3D shape reconstruction.
Furthermore, we show that our method can semantically
learn segments of 3D shapes in an unsupervised manner.
The code is publicly available from this link.
1. Introduction
Single image 3D reconstruction is a procedure of cap-
turing the structure and the surface of 3D shapes from sin-
gle RGB images, which has various applications in com-
puter vision, computer graphics, computer animation, and
augmented reality. Recent advanced methods have sub-
stantially improved 3D shape reconstruction with the ad-
vent of deep neural networks (DNNs). These methods can
be mainly categorized based on the representation of 3D
shapes into explicit-based [4,1,28] and implicit-based
[26,10,29,12,30] methods. Voxel-grid, as the most
straightforward explicit representation, is useful in many
applications. However, voxel-based methods generally suf-
fer from large memory usage and quantization artifacts [37].
Polygon mesh [21,41] has been introduced as alternative
representation. However, since many polygon mesh-based
methods start from a template mesh and deform it to recon-
struct the target 3D shapes [41,16], they can not produce
3D shapes with arbitrary topologies.
*equal contribution
Figure 1: The exploded view of 3DIAS representation. The
3D shapes consist of a union over the proposed constrained
implicit algebraic primitives with proper attributes.
On the other hand, implicit representations can approx-
imate surfaces of 3D shapes as zero-sets of continuous
functions in the Euclidean space. Recent implicit-based
methods have shown some promises to reconstruct arbi-
trary shapes without any template. [29,26,40,12,30,10]
These methods can be categorized into two mainstreams;
isosurface-based and primitive-based methods. Isosurface-
based methods generally generate a surface by employing a
neural network [29,26] as an implicit function that assigns
negative and positive values or different probabilities to the
points lying inside and outside the shape. However, for each
time visualization, these methods require all the neural net-
work parameters to extract the zero-sets by determining the
sign of many sample points in 3D space. Furthermore, these
representations are unsuitable for computer graphics and
virtual reality applications because they require additional
postprocessing like marching cubes to generate the final 3D
(a) Target (b) Quartic polynomials (c) Collection of primitives (d) After union
Figure 2: Composition of implicit algebraic surfaces. Our network approximates the target shape as a union over a set of
constrained implicit algebraic surfaces. The network estimates the coefficients and the center of polynomials. Note that since
the level sets for each primitive are quite different, the final surface has non-uniform level sets as shown in (d).
shapes. Contrastingly, primitive-based methods approxi-
mate 3D shapes by a group of primitives such as cubes [40],
ellipsoids [12], superquadrics [30], and convexes [10]. De-
spite their advantages in visualization and direct usage for
various applications, the resolutions of reconstructed shapes
are limited due to the simple topology (i.e., genus-zero) of
the primitives. Consequently, approximating a 3D shape re-
quires many of these simple primitives. Moreover, since
the geometrical complexity varies from shape to shape, de-
termining a sufficient number of primitives is challenging
for an arbitrary shape.
In this paper, we propose a novel primitive-based 3D
shape representation based on the learnable implicit alge-
braic surfaces named 3DIAS as shown in Figure 1. Since
implicit algebraic surfaces have high degrees of freedom,
they can describe complex shapes better than simple prim-
itives [2]. Besides, identifying an implicit algebraic primi-
tive is straightforward and depends on only a few parame-
ters. We apply various constraints on these primitives to fa-
cilitate learning and achieve detailed appearances. We limit
our primitives to the class of algebraically solvable implicit
algebraic surfaces to assist fast 2D rendering and 3D visu-
alization, which can be useful in many computer graphics
applications. Furthermore, we develop an upper bound con-
straint with an efficient parameterization to guarantee that
the primitives have closed surfaces and controlled sizes. Fi-
nally, we guide the primitives to cover different segments
of a target shape by restricting the locations of their cen-
ters. To generate these primitives, we design a DNN-based
encoder-decoder that captures the information of an obser-
vation (e.g., single image) and provides the parameters of
the primitives. In our experiments, we show that our method
outperforms state-of-the-art methods with most of the met-
rics. Moreover, we experimentally demonstrate that 3DIAS
can semantically learn the components of 3D shapes with-
out any supervision and adjust the number of primitives by
excluding the primitives with empty volumes.
We summarize, our main contributions as follows:
We propose a novel primitive-based 3D shape repre-
sentation with the learnable implicit algebraic surfaces,
which can produce more complex topologies with few
parameters hence appropriate for describing geometri-
cally complex shapes.
• We develop various constraints to produce solvable
and closed primitives with proper scales in desired lo-
cations to ease learning and generate appealing results.
• We experimentally demonstrate that 3DIAS outper-
forms state-of-the-art methods. Furthermore, we show
that it can semantically learn the components of 3D
shapes and adjust the number of used primitives.
2. Related Work
In this section, we review some related DNN-based 3D
shape reconstruction methods with various representations.
Explicit representations. A set of voxel is commonly used
for discriminative [25,32] and generative [9,13] tasks since
it is the most simple way in 3D representation. However, the
results represented with voxel have a limitation in resolution
due to memory issues. Although [17,38] proposed to re-
construct 3D objects in a multi-scale fashion, they are still
limited to comparably small 2563voxel grids and require
multiple forward passes to generate final 3D voxels. 3D
point clouds give an alternative representation of 3D shapes.
Qi et al. [31,33] pioneered point clouds as a discrimina-
tive representation for 3D tasks using deep learning. Fan et
al. [11] introduced point clouds in 3D reconstruction task
as an output representation. However, since point clouds
have no information for connections between the points, it
needs additional post-processing procedures [3,5] to build
the 3D mesh as an output. Mesh is another commonly used
representation for 3D shape [14,20]. However, most of
(4 FCs)
(2 FCs)
(4 FCs)
Learnable Symmetric
Single RGB
3D Implicit
Algebraic Surface
(Eq. 11)
(Eq. 5)
(Eq. 6)
Upper Bounds
(Eq. 4)
(Eq. 6)
(Eq. 2)
Sigmoid Tanh
Figure 3: The overview of Single RGB 3D surface reconstruction using 3DIAS. We use an encoder (ResNet-18 [18]) to learn
the local and global information from the given single RGB image. Then three sets of fully connected layers decode the
latent features to provide the coefficients, the scales, and the location of centers for Mprimitives. Note that we apply min
operator to take the union over the Mprimitives.
methods in 3D shape reconstruction using meshes gener-
ate meshes with simple topology [41] or utilize a template
as a reference. They can only manage the objects from the
same class [20,24] and can not guarantee to produce closed
surfaces [14].
Implicit Representations. Implicit representation is a good
alternative to avoid the problems above. In contrast to the
mesh-based approaches, implicit-based methods do not re-
quire a template from the same object class as input. There
are mainly two approaches; isosurface-based and primitive-
based methods. Chen et al. [8] propose a neural network
that takes the 3D points and latent code of the shape, then
outputs the values for each point, indicating whether the
point is outside or inside the shape as an occupancy func-
tion. Park et al. [29] utilize the signed distance function
to obtain the zero-set surface of the shape. Tatarchenko et
al. [39] proposed the occupancy function that implicitly de-
scribes the 3D surfaces as the continuous decision bound-
ary of a DNN-based classifier. Compared to voxel repre-
sentation, this method can estimate an occupancy function
continuous in 3D with a learnable neural network that can
generate at any arbitrary resolution. This approach signif-
icantly decreases memory usage during training. The sur-
face can be extracted as a mesh representation from the
learned model at test time by a multi-resolution isosurface
extraction method. The finite resolution of the voxel or oc-
tree cells limits the accuracy of the reconstructed shape by
these methods and their ability to capture fine details of 3D
shapes. Deng et al. [10] represented a shape with a con-
vex combination of half-planes. These methods use a sin-
gle global latent vector to represent entire surfaces of a 3D
shape. The latent vector is decoded into continuous surfaces
with the corresponding implicit networks. While this tech-
nique successfully models geometry, it often requires many
primitives to obtain a desirable appearance, and it is unclear
how many primitives are required.
For the 3D reconstruction task, we compare our implicit-
based approach against several state-of-the-art implicit-
based methods such as Structured Implicit Function (SIF)
[12], OccNet [26], and CvxNet [10]. Moreover, we select
Pixel2Mesh [41] and AtlasNet [15] , which use explicit sur-
face generation in contrast to the previous methods.
3. 3DIAS
In this section, we first introduce our 3D shape represen-
tation based on implicit algebraic surfaces. Then we explain
the additive constraints for effective learning. Next, we de-
scribe the proposed network and learning procedure to re-
construct the surface of 3D shapes with our representation.
3.1. 3D shape representation
3.1.1 Implicit algebraic primitives
We build a complex target 3D shape with a combination of
primitives p(x, y, z)that are the building blocks of 3D (i.e.,
basic geometric forms) as shown in Figure 1. To select a
primitive with a large degree of freedom (i.e., complex ge-
ometry and topology) and few parameters, we employ the
implicit algebraic surface that is a zero-level set of a multi-
variate polynomial function of x,y, and zas Eq. 1:
p(x, y, z) = X
aijk xiyjzk
=vAvT= 0,
where v= [1, x, y, z, x2, y2, z2, . . . ], and d,aijk , and A
are the degree, the coefficients, and coefficient matrix of the
polynomial function, respectively. Like many other implicit
surfaces, the implicit algebraic surface divides the space and
maps points in 3D space into negative and positive values.
Therefore, to represent detailed surfaces S(x, y, z )of 3D
shapes, we can combine these primitives by utilizing con-
structive solid geometry [19] and apply boolean operations
to them, which can be formulated as Eq. 2:
S(x, y, z) =
pm(x, y, z)
= min(p1(x, y, z), . . . , pM(x, y, z)),
where Mis the number of primitives in the union.
3.1.2 Constraints on primitives
We apply a set of constraints on the defined implicit alge-
braic primitive p(x, y, z)to better approximate the surface
S(x, y, z)of a target 3D shape.
Solvability of primitives. Easy visualization and render-
ing attributes for a 3D shape representation can be useful
in many computer graphics and virtual reality applications.
The class of implicit algebraic primitives with algebraic so-
lutions are appropriate representations for ray-tracing hence
achieving these properties. Accordingly, we use multivari-
ate quartic (d= 4) polynomial functions as the primitives
p(x, y, z)because they have the highest degree of freedom
among all implicit algebraic primitives with closed-form al-
gebraic solutions [34].
Closedness and scales of primitives. Beyond the afore-
mentioned constraint, we need to guarantee that the recon-
structed shape and hence all primitives have closed surfaces
as Figure 2. We can ensure that a quartic primitive has a
closed surface by enforcing its fourth-degree terms to al-
ways be positive [22] as Eq. 3:
p4(x, y, z) = X
aijk xiyjzk
where u= [x2, y2, z2, xy, yz , zx]and A[5:10] is the 6×6
sub-matrix of A, including the coefficients of fourth-degree
terms. This implies that A[5:10] 0is a positive definite
(PD) matrix. Note that, with the PD matrix A[5:10] 0,
the algebraic surface exists if and only if p3(x, y, z) =
P0i+j+k3aijk xiyjzkis negative and |p3(x, y, z)|>
|p4(x, y, z)|for some points in R3. Otherwise, the primi-
tive has zero volume because it has no real-valued solution.
Moreover, since each primitive reconstructs a different
segment of a target 3D shape, we need to ensure that its
volume is smaller than the target shape. To prevent generat-
ing large primitives and control their scales, we develop an
upper bound for each primitive. To reconstruct a primitive
p(x, y, z)included in the upper bound q(x, y, z), it is suffi-
cient to satisfies the inequality q(x, y, z)< p(x, y, z)in R3
which is also equivalent to Eq.4:
p(x, y, z) = h(x, y, z) + q(x, y, z),(4)
where h(x, y, z)is a positive-valued function. The func-
tion h(x, y, z)is always positive if and only if its matrix of
coefficients Hbe positive definite. As a result, the coeffi-
cient matrix Aof primitive p(x, y, z)is the summation of
a positive definite matrix H10×10 and the coefficient matrix
Q10×10 of the upper bound q(x, y, z). For simplicity we
consider the upper bound q(x, y, z)as Eq.5:
q(x, y, z) = x4+y4+z4R, (5)
where Ris a positive value that controls the size of the upper
bound. Note that the developed upper bound is a more gen-
eral constraint that also holds the criteria for the closedness
constraint. For proof, refer to the supplementary materials.
Locations of primitives. We also encourage the primitives
to better cover different components of 3D shape by apply-
ing a constraint on their locations. Accordingly, we restrict
the locations of their centers c= (c1, c2, c3)into the areas
nearby the shape and reformulate the primitives as Eq. 6:
p(x, y, z) = X
aijk (xc1)i(yc2)j(zc3)k= 0.(6)
Therefore, within these constraints, we can reconstruct
primitives with controlled scales and locations, which facil-
itates the reconstruction and provides more details.
3.2. 3D shape reconstruction
To reconstruct a 3D shape with the proposed representa-
tion of an input observation o∈ X (e.g., single image), we
design a DNN architecture that receives the input and out-
puts the corresponding matrix H, center c, and parameter R
for each primitive as shown in Figure 3.
3.2.1 Training losses
We apply various losses to reconstruct 3D shapes.
Loss sign. Since the target surface in 3D space divides the
inside and outside, we define a sign function sign(x, y, z) :
R3→ {0,1,1}on sample points PR3where the val-
ues 0,1, and 1correspond to the points on the target sur-
face, its inside, and its outside, respectively. Likewise, we
can classify the points on/inside/outside the reconstructed
implicit surfaces S(x, y, z)and reduce a loss between their
predicted and ground truth signs. We use mean square error
(MSE) as the loss function as Eq. 7:
λi·EpiPktanh(S(pi)) sign(pi)k2,(7)
where B ⊂ X and λiare a training batch and the weights
corresponding to each sign, respectively. Note that Lsign
enforce the network to reconstruct the desired surface and
refuse to generate the redundant surfaces simultaneously.
Moreover, the MSE loss forces further attention on distin-
guishing the inside and outside points near the reconstructed
surface because their tanh S(pi)are near zero while their
ground truths are 1and +1, respectively.
Loss normal. To improve the reconstruction, we use the
normal vectors as second-order information. Therefore, we
define a MSE loss between the ground-truth normal vectors
of the sample points pon on the surface of a target mesh
model ngand their normal vectors nrobtained by Eq. 8:
B=EponPknr(pon )ng(pon)k2,(8)
where the normal vectors on the union surface can be di-
rectly determined for any point on the surface as Eq. 9:
nr=∇S(x, y, z)
||∇S(x, y, z)|| =(S
∂x ,S
∂x ,S
∂z )
∂x )2+ ( S
∂y )2+ ( S
∂z )2
subject to: ∇S(x, y, z ) = pm(x, y, z),
m= arg min
(pm(x, y, z)).
Note that mis the index of the closest primitive (i.e., the
primitive with the smallest value pm(x, y, z)) to the point.
Finally, the total loss Ltotal
Bis the weighting average of all
defined losses with the corresponding weights as Eq. 10:
3.2.2 Implementation details
We consider the bounding box C= [1,1]3and fit the
given input 3D shapes into it by keeping its aspect ratio.
Then we extract 1Mpoints from [1.1,1.1]3R3sur-
rounding the 3D shapes, and at each iteration, we randomly
select 1% of them as jointly inside points pin and outside
points pout for all shapes in the batch B. Moreover, we
pick 10kpoints pon on the surface of each 3D shape and
randomly select 20% of them at each iteration.
To capture the information of the input observation o
Xwe employ the pretrained ResNet-18 [18] as the en-
coder. Then three sets of independent fully connected (FC)
layers (4096,4096,4096,55 ×M),(1024,512,256,3×
M),(256, M )decodes the encoded features to obtain the
parameters of each symmetric matrix B, scalar Rand cen-
ter cfor M= 100 primitives in Eq. 2, respectively. All FC
layers except the last layers are empowered by the ReLU
non-linear activation function. We also apply three batch
normalization layers after the first three FC layers of Bde-
coder to accelerate training and boost the performance.
To ensure His a PD matrix, we parameterize it as Eq. 11:
H=BBT+αI 0(11)
where α= 0.0001 is a small scalar factor, and Band I
are 10 ×10 symmetric and identity matrices, respectively.
Moreover, we apply the sigmoid function on the output of
the Rdecoder to generate a value in (0,1) R3as the
parameter Rof Eq. 5to control the size of all primitives
and guarantee that they are not larger than the size of the
bounding box C. Furthermore, we apply tanh on the output
of the cdecoder to generate the centers inside the bounding
box C. Therefore, each primitive is parameterized with 59
parameters in total including three parameter for the center
c= (c1, c2, c3), one parameter for the R, and 55 parameters
for the matrix B.
We set the parameters λon ,λin,λout , and λnas 2, 1, 10,
and 1, respectively. We train our encoder-decoder architec-
ture with Adam optimizer with the initial learning rate 1e-4,
weight decay 1e-7, and batch size of 64. We implement our
model in Python3.7 using PyTorch via CUDA instruction.
4. Experiments
In this section, we provide information about the eval-
uation setups and show qualitative and quantitative results
of our method compared to state-of-the-art methods on sin-
gle RGB image 3D shape reconstruction. We also perform
various ablation studies to analyze our method better. More
experiments are available in the supplementary material.
4.1. Dataset and Metrics
We evaluate our approach on the subset of the ShapeNet
dataset [6] with the same image renderings and train-
ing/testing split provided by Choy et al. [9]. We also em-
ploy mesh-fusion [36] to generate watertight meshes from
the 3D CAD models. We then use Houdini [35] to extract
the inside/outside/on points and the normal vectors. For
evaluation, we use the volumetric IoU, Chamfer [11], and
F-Score [23] metrics. Volumetric IoU is used to measure
the overlapped volume between the ground truth meshes
and the reconstructed surfaces. Chamfer is the mean of the
accuracy and the completeness score. The mean distance of
points on the reconstructed surface to their nearest neigh-
bors on the ground truth mesh is defined as the accuracy
metric. The completeness metric is defined in the opposite
direction of the accuracy metric. F-score is the harmonic
mean of precision which shows the percentage of correctly
reconstructed surfaces. To compute IoU, we sample 100k
points from the bounding box. To evaluate Chamfer and F-
score, we first transfer the reconstructed surfaces to meshes,
then similar to CvxNet[10] we sample 100kpoints on the
reconstructed and the ground-truth meshes.
Input RGB Image GT Mesh AtlasNet OccNet CvxNet 3DIAS (ours)SIF
CvxNet 3DIAS (ours)Input RGB Image
Figure 4: Qualitative comparison on single RGB image 3D shape reconstruction. SIF [12], AtlasNet [15], OccNet[26],
CvxNet[10], and our 3DIAS output reconstructed 3D shape from the given RGB image. (a) Comparison with other methods
for the samples shown in CvxNet [10]. (b) More qualitative comparisons with CvxNet [10].
Category IoU Chamfer F-Score
P2M AtlasNet OccNet SIF CvxNet 3DIAS P2M AtlasNet OccNet SIF CvxNet 3DIAS AtlasNet OccNet SIF CvxNet 3DIAS
airplane 0.420 - 0.571 0.530 0.598 0.549 0.187 0.104 0.147 0.167 0.093 0.087 67.24 62.87 52.81 68.16 59.48
bench 0.323 - 0.485 0.333 0.461 0.485 0.201 0.138 0.155 0.261 0.133 0.106 54.50 56.91 37.31 54.64 60.17
cabinet 0.664 - 0.733 0.648 0.709 0.730 0.196 0.175 0.167 0.233 0.160 0.123 46.43 61.79 31.68 46.09 61.81
car 0.552 - 0.737 0.657 0.675 0.737 0.180 0.141 0.159 0.161 0.103 0.091 51.51 56.91 37.66 47.33 58.07
chair 0.396 - 0.501 0.389 0.491 0.509 0.265 0.209 0.228 0.380 0.337 0.186 38.89 42.41 26.90 38.49 43.14
display 0.490 - 0.471 0.491 0.576 0.538 0.239 0.198 0.278 0.401 0.223 0.211 42.79 38.96 27.22 40.69 42.40
lamp 0.323 - 0.371 0.260 0.311 0.381 0.308 0.305 0.479 1.096 0.795 0.607 33.04 38.35 20.59 31.41 37.52
speaker 0.599 - 0.647 0.577 0.620 0.638 0.285 0.245 0.300 0.554 0.462 0.351 35.75 42.48 22.42 29.45 39.16
rifle 0.402 - 0.474 0.463 0.515 0.423 0.164 0.115 0.141 0.193 0.106 0.116 64.22 56.52 53.20 63.74 47.44
sofa 0.613 - 0.680 0.606 0.677 0.685 0.212 0.177 0.194 0.272 0.164 0.158 43.46 48.62 30.94 42.11 49.73
table 0.395 - 0.506 0.372 0.473 0.509 0.218 0.190 0.189 0.454 0.358 0.245 44.93 58.49 30.78 48.10 57.63
phone 0.661 - 0.720 0.658 0.719 0.751 0.149 0.128 0.140 0.159 0.083 0.080 58.85 66.09 45.61 59.64 71.35
vessel 0.397 - 0.530 0.502 0.552 0.538 0.212 0.151 0.218 0.208 0.173 0.206 49.87 42.37 36.04 45.88 40.70
mean 0.480 - 0.571 0.499 0.567 0.575 0.216 0.175 0.215 0.349 0.245 0.197 48.57 51.75 34.86 47.36 52.22
Table 1: Evaluation of single image 3D shape reconstruction. We evaluate and compare our method (3DIAS) to the state-of-
the-art methods including P2M [41], AtlasNet [15], OccNet[26], SIF [12], and CvxNet[10] on a part of ShapeNet dataset [6]
in terms of IoU, Chamfer, and F-score.
4.2. Reconstruction
We experimentally evaluate our method 3DIAS trained
on multi-class and compare it with state-of-the-art methods
on single RGB image 3D shape reconstruction and summa-
rize the results in Table 1. The experiments demonstrate
the superiority of 3DIAS compared to the explicit-based
methods P2M [41] and AtlasNet [15], the isosurface-based
method OccNet [26], and the recent primitive-based meth-
ods SIF [12] and CvxNet [10] in terms of volumetric IoU
and F-score. We also achieve the second-best performance
with the Chamfer metric. We show more quantitative re-
sults of 3DIAS for the trained network on single-class in
the supplementary material.
Moreover, we qualitatively evaluate 3DIAS trained on
single-class and compare it with the previous methods in
Figure 4. The results illustrate that 3DIAS achieves smooth
surfaces with desirable geometrical details. 3DIAS, unlike
the previous methods, is successful in reconstructing the 3D
shape with more complex topologies (e.g., chair) as shown
in Figure 4a. Moreover, compared to CvxNet [10], 3DIAS
can better reconstruct thin shapes (e.g., lamp) and when
similar shapes are rare in the training dataset (e.g., airplane
and car), see Figure 4b.
4.3. Ablation study
We perform several ablation studies to analyze our pro-
posed representation and reconstruction procedure. First,
we show the ability of our method to generate more com-
plex primitives compared to other primitive-based methods.
Then we compare the required number of parameters to rep-
resent 3D shapes with our representation and other methods.
Finally, we demonstrate the power of our reconstruction
Representation SIF OccNet CvxNet 3DIAS
Num of params 700 11M 7700 480
Table 2: Number of parameters. The average number of pa-
rameters for representing 3D shapes by different methods.
scheme to learn the semantic structures in an unsupervised
manner. Moreover, we evaluate the effects of the designed
constraints and the defined loss functions in reconstructing
3D shapes with high detailed appearances.
Figure 5: Statistics of the number of primitives. We com-
pute the average number of primitives selected among all
M= 100 primitives by the network for each category.
4.3.1 Complexity of primitives
We illustrate that our proposed constrained primitive is
able to form more geometrical (e.g., curved) and topolog-
ical (e.g., genus-one) complex shapes as shown in Fig-
ure 6. While the previous primitive-based methods such
as cubes [40], ellipsoids [12], superquadrics [30], and con-
vexes [10] cannot form such complex shapes.
4.3.2 The number of parameters
In section 3.1 we argue that a primitive may have no real
solution when |p3(x, y, z)|<|p4(x, y, z)|or |p3(x, y, z)|
is non-negative for all points (x, y, z)R3(i.e., no valid
surface). Accordingly, our method can ignore some of the
primitives among all M= 100 primitives by assigning
positive definite coefficient matrices Ato them and main-
tain a sufficient number of primitives. Therefore, these
not-solvable primitives do not participate in reconstruct-
ing the surface S. Note that, our method selects sets of
primitives that are mainly different for inter-category shapes
and have a large overlap for intra-category shapes. Dur-
ing the test phase, we can efficiently check the eigenvalues
(a) Lamp (b) Speaker (c) Chair
Figure 6: The complexity of our primitives. The first and
the second rows show the reconstructed shapes and their
corresponding primitives, respectively. The proposed prim-
itive can effectively present curved and torus shapes.
for the coefficient matrix of each primitive and eliminate
the primitives with non-negative eigenvalues. Our experi-
ments demonstrate that our network selects few primitives
to reconstruct 3D shapes, as shown in Figure 5. In addi-
tion, since each quartic primitive can finally be identified
with only 35 coefficients aijk , the surface Sof 3D shapes
with 3DIAS representation can be represented with only
35×13.71 '480 number of parameters on average. 3DIAS
requires 68.571%,0.004%, and 6.234% of the parameters
used in SIF [12], OccNet [26], and CvxNet [10] on average
to represent 3D shapes, respectively, see Table 2.
4.3.3 Unsupervised semantics segmentation
We also illustrate that our network learns a semantic struc-
ture without any part-level supervision such that one primi-
tive usually covers the same part of reconstructed 3D shapes
in the same class with 3DIAS representation. We evalu-
ate the semantic structures on the PartNet[27] dataset hav-
ing the labels of hierarchical parts of the shapeNet. The
quantitative experiments in Figure 7show that our method
achieves better and comparable average accuracy compared
to CvxNet [10] and BAE [7], respectively. In addition,
3DIAS achieves better accuracy than both methods for thin
parts (e.g., arm). Moreover, our qualitative experiments il-
lustrate that one primitive tends to cover the same seman-
tic part as shown in Figure 8. This tendency is more pro-
nounced for the dominant primitive that covers more points.
For instance, the dominant primitives mainly cover the seat
of chairs because most of the chairs have seat parts. Please
see the supplementary material for more examples.
Part Accuracy
Back 86.36 91.50 88.87
Seat 73.66 90.63 70.29
Base 88.46 71.95 78.51
Arm 65.75 38.94 74.86
mean 78.56 73.25 78.13
Figure 7: Evaluation of semantic segmentation. (left) The
distribution of PartNet labels within 4 primitives in chair
class. (right) The classification accuracy for each part. We
follow the evaluation method introduced in cvxnet [10].
Constraints IoU Chamfer F-Score
-center 0.549 0.387 46.72
-scale 0.559 0.261 48.45
-scale, -closedness 0.546 0.280 44.44
All 0.575 0.197 52.22
Table 3: Ablation study on constraints. We compare the ef-
fects of the center, the scale, and the closedness constraints
in terms of IoU, Chamfer, and F-score. Note that in each
configuration we ignore one or two constraints.
4.3.4 Effects of constraints
We study the effect of our designed constraints on recon-
structing 3D shapes. In each experiment, we evaluate or
baseline by ignoring one or more constraints. The quantita-
tive results based on volumetric IoU, Chamfer, and F-Score
show the importance of each constraint, see Table 3. We be-
lieve these constraints encourage the network to reconstruct
a detailed 3D shape, especially the center constraint.
4.3.5 Effects of losses
While Lsign for the inside/outside points tries to distinguish
inside and outside of 3D shapes, it is not enough to achieve
a detailed surface due to the lack of sample points near the
surface. Therefore, points on the surface and their normal
vectors can facilitate the reconstruction. Note that normal
vectors carry important information on 3D geometry, such
as the local orientation of surfaces. Accordingly, we use
Lsign loss and Lnloss for the points on the surface to better
approximate the surfaces. We evaluate the effects of each
Lsign,Ln, and their combination by excluding them for the
points on the surface and summarize the results in Table 4.
Note that for all the experiments in Table 4we do not ex-
(a) Airplane (b) Chair
Figure 8: Qualitative results on unsupervised semantic seg-
mentation. We visualize the results of 3DIAS for some sam-
ples in the categories of (a) airplane and (b) chair.
Losses IoU Chamfer F-Score
-Ln0.568 0.210 49.17
-Lsign 0.548 0.219 43.77
-Lsign, -Ln0.542 0.232 42.37
All 0.575 0.197 52.22
Table 4: Ablation study on losses. We compare the effects
of Lsign and Lnlosses for the points on the surface in terms
of IoU, Chamfer, and F-score. Note that in each configura-
tion we ignore one or two loss functions. Moreover, we do
not exclude the Lsign for the inside/outside points. Please
see the supplementary material for more examples.
clude Lsign for the inside/outside points. The results indi-
cate the importance of points on the surface to achieve more
detailed 3D shapes.
5. Conclusion
In this paper, we propose a primitive-based representa-
tion and a learning scheme in which the primitives are learn-
able implicit algebraic surfaces that can jointly approximate
3D shapes. We design various constraints and loss functions
to achieve high-quality and detailed 3D shapes. We exper-
imentally demonstrate that our method outperforms state-
of-the-art methods in most of the metrics. Moreover, we il-
lustrate that our method can learn semantic meanings with-
out part-level supervision by automatically selecting sets of
primitives parametrized by only a few parameters. In the
future, we will utilize the solvability of the designed prim-
itives to develop a soft renderer which leads to reconstruct
the 3D shapes with self-supervised learning.
This work was supported in part by an IITP grant funded
by the Korean government [No. 2021-0-01343, Artificial
Intelligence Graduate School Program (Seoul National Uni-
[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and
Leonidas J. Guibas. Representation learning and adversarial
generation of 3d point clouds. CoRR, 2017.
[2] C. Bajaj. The emergence of algebraic curves and surfaces in
geometric design. 1992.
[3] Fausto Bernardini, J. Mittleman, Holly Rushmeier, Cl´
Silva, and Gabriel Taubin. The ball-pivoting algorithm for
surface reconstruction. Visualization and Computer Graph-
ics, IEEE Transactions on, 5:349 – 359, 11 1999.
[4] Andrew Brock, Theodore Lim, James M Ritchie, and
Nick Weston. Generative and discriminative voxel mod-
eling with convolutional neural networks. arXiv preprint
arXiv:1608.04236, 2016.
[5] F. Calakli and Gabriel Taubin. Ssd: Smooth signed distance
surface reconstruction. Computer Graphics Forum, 30:1993
– 2002, 11 2011.
[6] Angel X. Chang, Thomas A. Funkhouser, Leonidas J.
Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio
Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong
Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich
3d model repository. CoRR, 2015.
[7] Zhiqin Chen, Kangxue Yin, Matthew Fisher, Siddhartha
Chaudhuri, and Hao Zhang. Bae-net: Branched autoen-
coder for shape co-segmentation. Proceedings of Interna-
tional Conference on Computer Vision (ICCV), 2019.
[8] Zhiqin Chen and Hao Zhang. Learning implicit fields for
generative shape modeling. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
5939–5948, 2019.
[9] Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin
Chen, and Silvio Savarese. 3d-r2n2: A unified approach for
single and multi-view 3d object reconstruction. CoRR, 2016.
[10] Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien
Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. Cvxnet:
Learnable convex decomposition. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 31–44, 2020.
[11] Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point
set generation network for 3d object reconstruction from a
single image. CoRR, 2016.
[12] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna,
William T. Freeman, and Thomas A. Funkhouser. Learn-
ing shape templates with structured implicit functions. 2019
IEEE/CVF International Conference on Computer Vision
(ICCV), 2019.
[13] Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, and Ab-
hinav Gupta. Learning a predictable and generative vector
representation for objects. CoRR, 2016.
[14] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan
Russell, and Mathieu Aubry. AtlasNet: A Papier-Mˆ
e Ap-
proach to Learning 3D Surface Generation. In Proceedings
IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), 2018.
[15] Thibault Groueix, Matthew Fisher, Vladimir G. Kim,
Bryan C. Russell, and Mathieu Aubry. Atlasnet: A papier-
e approach to learning 3d surface generation. CoRR,
[16] X. Han, Hamid Laga, and M. Bennamoun. Image-based 3d
object reconstruction: State-of-the-art and trends in the deep
learning era. IEEE transactions on pattern analysis and ma-
chine intelligence, 2019.
[17] Christian H¨
ane, Shubham Tulsiani, and Jitendra Malik. Hi-
erarchical surface prediction for 3d object reconstruction.
CoRR, 2017.
[18] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
[19] John F. Hughes, Andries van Dam, Morgan McGuire,
David F. Sklar, James D. Foley, Steven K. Feiner, and Kurt
Akeley. Computer Graphics - Principles and Practice, 3rd
Edition. Addison-Wesley, 2014.
[20] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and
Jitendra Malik. End-to-end recovery of human shape and
pose. CoRR, 2017.
[21] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and
Jitendra Malik. Learning category-specific mesh reconstruc-
tion from image collections. CoRR, 2018.
[22] D. Keren, D. Cooper, and J. Subrahmonia. Describing com-
plicated objects by implicit polynomials. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 16(1):38–53,
[23] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
Koltun. Tanks and temples: Benchmarking large-scale scene
reconstruction. ACM Trans. Graph., 36(4), July 2017.
[24] C. Kong, C. Lin, and S. Lucey. Using locally corresponding
cad models for dense 3d reconstructions from a single image.
In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 5603–5611, 2017.
[25] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
neural network for real-time object recognition. In 2015
IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 922–928, 2015.
[26] Lars M. Mescheder, Michael Oechsle, Michael Niemeyer,
Sebastian Nowozin, and Andreas Geiger. Occupancy net-
works: Learning 3d reconstruction in function space. CoRR,
[27] Kaichun Mo, Shilin Zhu, Angel X. Chang, L. Yi, Subarna
Tripathi, L. Guibas, and H. Su. Partnet: A large-scale bench-
mark for fine-grained and hierarchical part-level 3d object
understanding. 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2019.
[28] Federico Monti, Davide Boscaini, Jonathan Masci,
Emanuele Rodol`
a, Jan Svoboda, and Michael M. Bronstein.
Geometric deep learning on graphs and manifolds using
mixture model cnns. CoRR, 2016.
[29] Jeong Joon Park, Peter Florence, Julian Straub, Richard
Newcombe, and Steven Lovegrove. Deepsdf: Learning con-
tinuous signed distance functions for shape representation.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019.
[30] Despoina Paschalidou, Ali O. Ulusoy, and Andreas Geiger.
Superquadrics revisited: Learning 3d shape parsing beyond
cuboids. 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 10336–10345, 2019.
[31] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and
Leonidas J. Guibas. Pointnet: Deep learning on point sets
for 3d classification and segmentation. CoRR, 2016.
[32] Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela
Dai, Mengyuan Yan, and Leonidas J. Guibas. Volumetric and
multi-view cnns for object classification on 3d data. CoRR,
[33] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J.
Guibas. Pointnet++: Deep hierarchical feature learning on
point sets in a metric space. CoRR, 2017.
[34] Michael I. Rosen. Niels hendrik abel and equations of
the fifth degree. The American Mathematical Monthly,
102(6):495–505, 1995.
[35] SideFX. Houdini, 2020.
[36] David Stutz and Andreas Geiger. Learning 3d shape comple-
tion under weak supervision. CoRR, 2018.
[37] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.
Octree generating networks: Efficient convolutional archi-
tectures for high-resolution 3d outputs. In The IEEE Inter-
national Conference on Computer Vision (ICCV), 2017.
[38] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.
Octree generating networks: Efficient convolutional archi-
tectures for high-resolution 3d outputs. pages 2107–2115,
10 2017.
[39] Maxim Tatarchenko, Stephan R. Richter, Rene Ranftl,
Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do
single-view 3d reconstruction networks learn? In The IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), June 2019.
[40] Shubham Tulsiani, H. Su, L. Guibas, Alexei A. Efros, and
Jitendra Malik. Learning shape abstractions by assembling
volumetric primitives. 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017.
[41] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei
Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh
models from single RGB images. CoRR, 2018.
Supplementary Material for
3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces
Mohsen YavartanooJaeyoung ChungReyhaneh Neshatavar Kyoung Mu Lee
ASRI, Department of ECE, Seoul National University, Seoul, Korea
Category IoU Chamfer F-Score #primitive
Multi Single Multi Single Multi Single Multi Single
airplane 0.549 0.621 0.087 0.580 59.48 69.51 6.915 23.42
bench 0.485 0.462 0.106 0.677 60.17 59.60 16.97 22.24
cabinet 0.730 0.726 0.123 0.141 61.81 53.24 20.91 22.98
car 0.737 0.747 0.091 0.082 58.07 61.08 10.02 37.96
chair 0.509 0.493 0.186 0.304 43.14 38.85 17.17 19.15
display 0.538 0.511 0.211 1.137 42.40 37.85 13.88 14.95
lamp 0.381 0.352 0.607 1.494 37.52 35.96 9.946 17.54
speaker 0.638 0.632 0.351 0.310 39.16 34.19 19.24 19.65
rifle 0.423 0.509 0.116 0.760 47.44 61.60 4.719 17.16
sofa 0.685 0.667 0.158 0.186 49.73 45.20 21.35 24.55
table 0.509 0.478 0.245 0.369 57.63 50.96 15.97 16.68
phone 0.751 0.734 0.080 0.168 71.35 65.85 14.11 18.23
vessel 0.538 0.550 0.206 0.200 40.70 43.89 7.067 19.41
mean 0.575 0.575 0.197 0.493 52.22 50.60 13.71 21.07
Table S1: Comparison of multi-class and single-class sin-
gle RGB image 3D shape reconstruction in terms of IoU,
Chamfer, F-score, and the number of primitives.
S. 3DIAS analysis
In this section, we provide more experimental analyses
of our proposed 3DIAS method. First, we prove that our
scale constraint satisfies the criteria of the closedness con-
straint. Then we quantitatively and qualitatively show and
compare our experiments of 3DIAS trained on single-class
and multi-class. Finally, we show the effect of viewpoint in
reconstructing 3D shapes with our proposed method.
S.1. Holding closedness constraint
In the section 3.1.2, we claim that the parametrized coef-
ficient matrix Aholds the criteria for the closedness con-
straint. To ensure the closedness constraint is satisfied,
the coefficient matrix A[5:10] in Eq. 3of the fourth-degree
terms p4(x, y, z)must be positive definite. The matrix
A[5:10], as the principal sub-matrix of the coefficient ma-
*equal contribution
trix Aof p(x, y, z), is the summation of the correspond-
ing sub-matrices H[5:10] and Q[5:10]. Since the matrix His
assumed positive definite, its principal sub-matrices (e.g.,
H[5:10]) are also positive definite. Moreover, the corre-
sponding sub-matrix Q[5:10] of the coefficient matrix Qis
a diagonal matrix with the values [1,1,1,0,0,0]; hence it is
positive semi-definite. Accordingly, the sub-matrix A[5:10]
as the summation of a positive definite matrix and a positive
semi-definite matrix is also positive definite, which satisfies
the closedness constraint.
S.2. Multi-class vs single-class training
We also evaluate our method for the trained network in-
dividually on each class and compare the results in terms
of IoU, Chamfer, and F-score with the multi-class case and
summarize the results in Table S1. Surprisingly, the com-
parison demonstrates that our method trained on the multi-
class can better reconstruct on average, presumably due to
the overfitting to the training set. However, our method
trained on the single-class selects more primitives on av-
erage to reconstruct 3D shapes as shown in Table S1. We
believe it is because more available primitives are per class
to represent 3D shapes in the single-class training, while
the network must distribute the limited primitives among
all categories in the multi-class training.
We also visualize the reconstructed 3D shapes by 3DIAS
for both the single-class and the multi-class cases in Fig-
ure S2. The results show that the primitives share the same
semantically meaning among the 3D shapes in the same
category in both single-class and multi-class cases. Based
on our qualitative and quantitative experiments, we believe
that the Chamfer is not a suitable and reliable metric for the
3D shape reconstruction task. For instance, Chamfer shows
better performance for multi-class training on airplane and
rifle categories, while the qualitative results show better ap-
pearances for single-class training.
Figure S2: Qualitative comparison of single-class and multi-class cases. We visualize the results for some samples in each
category. The first, second, and third lines for each category show the input RGB images, single-class training results, and
multi-class training results, respectively.
S.3. Effect of viewpoint
We also show the effect of viewpoint in Figure S3. The
results illustrate that the quality of the reconstructed 3D
shapes is highly dependent on the point of view when sim-
ilar shapes are rare in the training dataset (e.g., bottom air-
plane). However, we show that this effect decreases when
there are enough similar shapes to the query shape (e.g., top
airplane) in the training dataset.
RGB Image Aligned Top-view Front-view Side-view
Figure S3: Effect of viewpoint in 3D shape reconstruction. We visualize the reconstructed 3D shape by 3DIAS for two
samples. (top) and (bottom) show two airplanes with small and large viewpoint effects on their appearances, respectively.
... Decomposing shapes into parts with specific attributes have been extensively studied in computer graphics [61,19,76,95,93]. Recent deep learning based methods tried to resolve this problem by learning primitives using a data-driven strategy [16,68,69,20,76,59,89]. The primitives could be convex polytopes [68,16], 3D Gaussian functions or spheres [22,72]. ...
Deep implicit functions have shown remarkable shape modeling ability in various 3D computer vision tasks. One drawback is that it is hard for them to represent a 3D shape as multiple parts. Current solutions learn various primitives and blend the primitives directly in the spatial space, which still struggle to approximate the 3D shape accurately. To resolve this problem, we introduce a novel implicit representation to represent a single 3D shape as a set of parts in the latent space, towards both highly accurate and plausibly interpretable shape modeling. Our insight here is that both the part learning and the part blending can be conducted much easier in the latent space than in the spatial space. We name our method Latent Partition Implicit (LPI), because of its ability of casting the global shape modeling into multiple local part modeling, which partitions the global shape unity. LPI represents a shape as Signed Distance Functions (SDFs) using surface codes. Each surface code is a latent code representing a part whose center is on the surface, which enables us to flexibly employ intrinsic attributes of shapes or additional surface properties. Eventually, LPI can reconstruct both the shape and the parts on the shape, both of which are plausible meshes. LPI is a multi-level representation, which can partition a shape into different numbers of parts after training. LPI can be learned without ground truth signed distances, point normals or any supervision for part partition. LPI outperforms the latest methods under the widely used benchmarks in terms of reconstruction accuracy and modeling interpretability. Our code, data and models are available at
... Some latest researches have investigated decomposing or fitting 3D objects into primitive representations by deep learning approach [15,36,49,58,63,77]. Other methods related to primitives focus on deep shape generation models, where the primitives act as intermediate representations or building blocks [33,70,80]. ...
Numerous advancements in deep learning can be attributed to the access to large-scale and well-annotated datasets. However, such a dataset is prohibitively expensive in 3D computer vision due to the substantial collection cost. To alleviate this issue, we propose a cost-effective method for automatically generating a large amount of 3D objects with annotations. In particular, we synthesize objects simply by assembling multiple random primitives. These objects are thus auto-annotated with part labels originating from primitives. This allows us to perform multi-task learning by combining the supervised segmentation with unsupervised reconstruction. Considering the large overhead of learning on the generated dataset, we further propose a dataset distillation strategy to remove redundant samples regarding a target dataset. We conduct extensive experiments for the downstream tasks of 3D object classification. The results indicate that our dataset, together with multi-task pretraining on its annotations, achieves the best performance compared to other commonly used datasets. Further study suggests that our strategy can improve the model performance by pretraining and fine-tuning scheme, especially for the dataset with a small scale. In addition, pretraining with the proposed dataset distillation method can save 86\% of the pretraining time with negligible performance degradation. We expect that our attempt provides a new data-centric perspective for training 3D deep models.
3D reconstruction is a longstanding ill-posed problem, which has been explored for decades by the computer vision, computer graphics, and machine learning communities. Since 2015, image-based 3D reconstruction using convolutional neural networks (CNN) has attracted increasing interest and demonstrated an impressive performance. Given this new era of rapid evolution, this article provides a comprehensive survey of the recent developments in this field. We focus on the works which use deep learning techniques to estimate the 3D shape of generic objects either from a single or multiple RGB images. We organize the literature based on the shape representations, the network architectures, and the training mechanisms they use. While this survey is intended for methods which reconstruct generic objects, we also review some of the recent works which focus on specific object classes such as human body shapes and faces. We provide an analysis and comparison of the performance of some key papers, summarize some of the open problems in this field, and discuss promising directions for future research.