Content uploaded by Mohsen Yavartanoo

Author content

All content in this area was uploaded by Mohsen Yavartanoo on Aug 19, 2021

Content may be subject to copyright.

3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces

Mohsen Yavartanoo*Jaeyoung Chung*Reyhaneh Neshatavar Kyoung Mu Lee

ASRI, Department of ECE, Seoul National University, Seoul, Korea

{myavartanoo,robot0321,reyhanehneshat,kyoungmu}@snu.ac.kr

Abstract

3D Shape representation has substantial effects on 3D

shape reconstruction. Primitive-based representations ap-

proximate a 3D shape mainly by a set of simple implicit

primitives, but the low geometrical complexity of the prim-

itives limits the shape resolution. Moreover, setting a sufﬁ-

cient number of primitives for an arbitrary shape is chal-

lenging. To overcome these issues, we propose a con-

strained implicit algebraic surface as the primitive with few

learnable coefﬁcients and higher geometrical complexities

and a deep neural network to produce these primitives. Our

experiments demonstrate the superiorities of our method in

terms of representation power compared to the state-of-the-

art methods in single RGB image 3D shape reconstruction.

Furthermore, we show that our method can semantically

learn segments of 3D shapes in an unsupervised manner.

The code is publicly available from this link.

1. Introduction

Single image 3D reconstruction is a procedure of cap-

turing the structure and the surface of 3D shapes from sin-

gle RGB images, which has various applications in com-

puter vision, computer graphics, computer animation, and

augmented reality. Recent advanced methods have sub-

stantially improved 3D shape reconstruction with the ad-

vent of deep neural networks (DNNs). These methods can

be mainly categorized based on the representation of 3D

shapes into explicit-based [4,1,28] and implicit-based

[26,10,29,12,30] methods. Voxel-grid, as the most

straightforward explicit representation, is useful in many

applications. However, voxel-based methods generally suf-

fer from large memory usage and quantization artifacts [37].

Polygon mesh [21,41] has been introduced as alternative

representation. However, since many polygon mesh-based

methods start from a template mesh and deform it to recon-

struct the target 3D shapes [41,16], they can not produce

3D shapes with arbitrary topologies.

*equal contribution

Figure 1: The exploded view of 3DIAS representation. The

3D shapes consist of a union over the proposed constrained

implicit algebraic primitives with proper attributes.

On the other hand, implicit representations can approx-

imate surfaces of 3D shapes as zero-sets of continuous

functions in the Euclidean space. Recent implicit-based

methods have shown some promises to reconstruct arbi-

trary shapes without any template. [29,26,40,12,30,10]

These methods can be categorized into two mainstreams;

isosurface-based and primitive-based methods. Isosurface-

based methods generally generate a surface by employing a

neural network [29,26] as an implicit function that assigns

negative and positive values or different probabilities to the

points lying inside and outside the shape. However, for each

time visualization, these methods require all the neural net-

work parameters to extract the zero-sets by determining the

sign of many sample points in 3D space. Furthermore, these

representations are unsuitable for computer graphics and

virtual reality applications because they require additional

postprocessing like marching cubes to generate the ﬁnal 3D

(a) Target (b) Quartic polynomials (c) Collection of primitives (d) After union

Figure 2: Composition of implicit algebraic surfaces. Our network approximates the target shape as a union over a set of

constrained implicit algebraic surfaces. The network estimates the coefﬁcients and the center of polynomials. Note that since

the level sets for each primitive are quite different, the ﬁnal surface has non-uniform level sets as shown in (d).

shapes. Contrastingly, primitive-based methods approxi-

mate 3D shapes by a group of primitives such as cubes [40],

ellipsoids [12], superquadrics [30], and convexes [10]. De-

spite their advantages in visualization and direct usage for

various applications, the resolutions of reconstructed shapes

are limited due to the simple topology (i.e., genus-zero) of

the primitives. Consequently, approximating a 3D shape re-

quires many of these simple primitives. Moreover, since

the geometrical complexity varies from shape to shape, de-

termining a sufﬁcient number of primitives is challenging

for an arbitrary shape.

In this paper, we propose a novel primitive-based 3D

shape representation based on the learnable implicit alge-

braic surfaces named 3DIAS as shown in Figure 1. Since

implicit algebraic surfaces have high degrees of freedom,

they can describe complex shapes better than simple prim-

itives [2]. Besides, identifying an implicit algebraic primi-

tive is straightforward and depends on only a few parame-

ters. We apply various constraints on these primitives to fa-

cilitate learning and achieve detailed appearances. We limit

our primitives to the class of algebraically solvable implicit

algebraic surfaces to assist fast 2D rendering and 3D visu-

alization, which can be useful in many computer graphics

applications. Furthermore, we develop an upper bound con-

straint with an efﬁcient parameterization to guarantee that

the primitives have closed surfaces and controlled sizes. Fi-

nally, we guide the primitives to cover different segments

of a target shape by restricting the locations of their cen-

ters. To generate these primitives, we design a DNN-based

encoder-decoder that captures the information of an obser-

vation (e.g., single image) and provides the parameters of

the primitives. In our experiments, we show that our method

outperforms state-of-the-art methods with most of the met-

rics. Moreover, we experimentally demonstrate that 3DIAS

can semantically learn the components of 3D shapes with-

out any supervision and adjust the number of primitives by

excluding the primitives with empty volumes.

We summarize, our main contributions as follows:

• We propose a novel primitive-based 3D shape repre-

sentation with the learnable implicit algebraic surfaces,

which can produce more complex topologies with few

parameters hence appropriate for describing geometri-

cally complex shapes.

• We develop various constraints to produce solvable

and closed primitives with proper scales in desired lo-

cations to ease learning and generate appealing results.

• We experimentally demonstrate that 3DIAS outper-

forms state-of-the-art methods. Furthermore, we show

that it can semantically learn the components of 3D

shapes and adjust the number of used primitives.

2. Related Work

In this section, we review some related DNN-based 3D

shape reconstruction methods with various representations.

Explicit representations. A set of voxel is commonly used

for discriminative [25,32] and generative [9,13] tasks since

it is the most simple way in 3D representation. However, the

results represented with voxel have a limitation in resolution

due to memory issues. Although [17,38] proposed to re-

construct 3D objects in a multi-scale fashion, they are still

limited to comparably small 2563voxel grids and require

multiple forward passes to generate ﬁnal 3D voxels. 3D

point clouds give an alternative representation of 3D shapes.

Qi et al. [31,33] pioneered point clouds as a discrimina-

tive representation for 3D tasks using deep learning. Fan et

al. [11] introduced point clouds in 3D reconstruction task

as an output representation. However, since point clouds

have no information for connections between the points, it

needs additional post-processing procedures [3,5] to build

the 3D mesh as an output. Mesh is another commonly used

representation for 3D shape [14,20]. However, most of

B

decorer

(4 FCs)

R

decorer

(2 FCs)

c

decorer

(4 FCs)

ResNet-18

Learnable Symmetric

Matrices

Single RGB

Image

3D Implicit

Algebraic Surface

(3DIAS)

Constraints

(Eq. 11)

(Eq. 5)

(Eq. 6)

Upper Bounds

Centers

(Eq. 4)

(Eq. 6)

(Eq. 2)

Sigmoid Tanh

Figure 3: The overview of Single RGB 3D surface reconstruction using 3DIAS. We use an encoder (ResNet-18 [18]) to learn

the local and global information from the given single RGB image. Then three sets of fully connected layers decode the

latent features to provide the coefﬁcients, the scales, and the location of centers for Mprimitives. Note that we apply min

operator to take the union over the Mprimitives.

methods in 3D shape reconstruction using meshes gener-

ate meshes with simple topology [41] or utilize a template

as a reference. They can only manage the objects from the

same class [20,24] and can not guarantee to produce closed

surfaces [14].

Implicit Representations. Implicit representation is a good

alternative to avoid the problems above. In contrast to the

mesh-based approaches, implicit-based methods do not re-

quire a template from the same object class as input. There

are mainly two approaches; isosurface-based and primitive-

based methods. Chen et al. [8] propose a neural network

that takes the 3D points and latent code of the shape, then

outputs the values for each point, indicating whether the

point is outside or inside the shape as an occupancy func-

tion. Park et al. [29] utilize the signed distance function

to obtain the zero-set surface of the shape. Tatarchenko et

al. [39] proposed the occupancy function that implicitly de-

scribes the 3D surfaces as the continuous decision bound-

ary of a DNN-based classiﬁer. Compared to voxel repre-

sentation, this method can estimate an occupancy function

continuous in 3D with a learnable neural network that can

generate at any arbitrary resolution. This approach signif-

icantly decreases memory usage during training. The sur-

face can be extracted as a mesh representation from the

learned model at test time by a multi-resolution isosurface

extraction method. The ﬁnite resolution of the voxel or oc-

tree cells limits the accuracy of the reconstructed shape by

these methods and their ability to capture ﬁne details of 3D

shapes. Deng et al. [10] represented a shape with a con-

vex combination of half-planes. These methods use a sin-

gle global latent vector to represent entire surfaces of a 3D

shape. The latent vector is decoded into continuous surfaces

with the corresponding implicit networks. While this tech-

nique successfully models geometry, it often requires many

primitives to obtain a desirable appearance, and it is unclear

how many primitives are required.

For the 3D reconstruction task, we compare our implicit-

based approach against several state-of-the-art implicit-

based methods such as Structured Implicit Function (SIF)

[12], OccNet [26], and CvxNet [10]. Moreover, we select

Pixel2Mesh [41] and AtlasNet [15] , which use explicit sur-

face generation in contrast to the previous methods.

3. 3DIAS

In this section, we ﬁrst introduce our 3D shape represen-

tation based on implicit algebraic surfaces. Then we explain

the additive constraints for effective learning. Next, we de-

scribe the proposed network and learning procedure to re-

construct the surface of 3D shapes with our representation.

3.1. 3D shape representation

3.1.1 Implicit algebraic primitives

We build a complex target 3D shape with a combination of

primitives p(x, y, z)that are the building blocks of 3D (i.e.,

basic geometric forms) as shown in Figure 1. To select a

primitive with a large degree of freedom (i.e., complex ge-

ometry and topology) and few parameters, we employ the

implicit algebraic surface that is a zero-level set of a multi-

variate polynomial function of x,y, and zas Eq. 1:

p(x, y, z) = X

0≤i+j+k≤d

aijk xiyjzk

=vAvT= 0,

(1)

where v= [1, x, y, z, x2, y2, z2, . . . ], and d,aijk , and A

are the degree, the coefﬁcients, and coefﬁcient matrix of the

polynomial function, respectively. Like many other implicit

surfaces, the implicit algebraic surface divides the space and

maps points in 3D space into negative and positive values.

Therefore, to represent detailed surfaces S(x, y, z )of 3D

shapes, we can combine these primitives by utilizing con-

structive solid geometry [19] and apply boolean operations

to them, which can be formulated as Eq. 2:

S(x, y, z) =

M

[

m=1

pm(x, y, z)

= min(p1(x, y, z), . . . , pM(x, y, z)),

(2)

where Mis the number of primitives in the union.

3.1.2 Constraints on primitives

We apply a set of constraints on the deﬁned implicit alge-

braic primitive p(x, y, z)to better approximate the surface

S(x, y, z)of a target 3D shape.

Solvability of primitives. Easy visualization and render-

ing attributes for a 3D shape representation can be useful

in many computer graphics and virtual reality applications.

The class of implicit algebraic primitives with algebraic so-

lutions are appropriate representations for ray-tracing hence

achieving these properties. Accordingly, we use multivari-

ate quartic (d= 4) polynomial functions as the primitives

p(x, y, z)because they have the highest degree of freedom

among all implicit algebraic primitives with closed-form al-

gebraic solutions [34].

Closedness and scales of primitives. Beyond the afore-

mentioned constraint, we need to guarantee that the recon-

structed shape and hence all primitives have closed surfaces

as Figure 2. We can ensure that a quartic primitive has a

closed surface by enforcing its fourth-degree terms to al-

ways be positive [22] as Eq. 3:

p4(x, y, z) = X

i+j+k=4

aijk xiyjzk

=uA[5:10]uT>0,

(3)

where u= [x2, y2, z2, xy, yz , zx]and A[5:10] is the 6×6

sub-matrix of A, including the coefﬁcients of fourth-degree

terms. This implies that A[5:10] 0is a positive deﬁnite

(PD) matrix. Note that, with the PD matrix A[5:10] 0,

the algebraic surface exists if and only if p3↓(x, y, z) =

P0≤i+j+k≤3aijk xiyjzkis negative and |p3↓(x, y, z)|>

|p4(x, y, z)|for some points in R3. Otherwise, the primi-

tive has zero volume because it has no real-valued solution.

Moreover, since each primitive reconstructs a different

segment of a target 3D shape, we need to ensure that its

volume is smaller than the target shape. To prevent generat-

ing large primitives and control their scales, we develop an

upper bound for each primitive. To reconstruct a primitive

p(x, y, z)included in the upper bound q(x, y, z), it is sufﬁ-

cient to satisﬁes the inequality q(x, y, z)< p(x, y, z)in R3

which is also equivalent to Eq.4:

p(x, y, z) = h(x, y, z) + q(x, y, z),(4)

where h(x, y, z)is a positive-valued function. The func-

tion h(x, y, z)is always positive if and only if its matrix of

coefﬁcients Hbe positive deﬁnite. As a result, the coefﬁ-

cient matrix Aof primitive p(x, y, z)is the summation of

a positive deﬁnite matrix H10×10 and the coefﬁcient matrix

Q10×10 of the upper bound q(x, y, z). For simplicity we

consider the upper bound q(x, y, z)as Eq.5:

q(x, y, z) = x4+y4+z4−R, (5)

where Ris a positive value that controls the size of the upper

bound. Note that the developed upper bound is a more gen-

eral constraint that also holds the criteria for the closedness

constraint. For proof, refer to the supplementary materials.

Locations of primitives. We also encourage the primitives

to better cover different components of 3D shape by apply-

ing a constraint on their locations. Accordingly, we restrict

the locations of their centers c= (c1, c2, c3)into the areas

nearby the shape and reformulate the primitives as Eq. 6:

p(x, y, z) = X

0≤i+j+k≤4

aijk (x−c1)i(y−c2)j(z−c3)k= 0.(6)

Therefore, within these constraints, we can reconstruct

primitives with controlled scales and locations, which facil-

itates the reconstruction and provides more details.

3.2. 3D shape reconstruction

To reconstruct a 3D shape with the proposed representa-

tion of an input observation o∈ X (e.g., single image), we

design a DNN architecture that receives the input and out-

puts the corresponding matrix H, center c, and parameter R

for each primitive as shown in Figure 3.

3.2.1 Training losses

We apply various losses to reconstruct 3D shapes.

Loss sign. Since the target surface in 3D space divides the

inside and outside, we deﬁne a sign function sign(x, y, z) :

R3−→ {0,−1,1}on sample points P⊂R3where the val-

ues 0,−1, and 1correspond to the points on the target sur-

face, its inside, and its outside, respectively. Likewise, we

can classify the points on/inside/outside the reconstructed

implicit surfaces S(x, y, z)and reduce a loss between their

predicted and ground truth signs. We use mean square error

(MSE) as the loss function as Eq. 7:

Lsign

B=X

i∈{on,in,out}

λi·Epi∼Pktanh(S(pi)) −sign(pi)k2,(7)

where B ⊂ X and λiare a training batch and the weights

corresponding to each sign, respectively. Note that Lsign

B

enforce the network to reconstruct the desired surface and

refuse to generate the redundant surfaces simultaneously.

Moreover, the MSE loss forces further attention on distin-

guishing the inside and outside points near the reconstructed

surface because their tanh S(pi)are near zero while their

ground truths are −1and +1, respectively.

Loss normal. To improve the reconstruction, we use the

normal vectors as second-order information. Therefore, we

deﬁne a MSE loss between the ground-truth normal vectors

of the sample points pon on the surface of a target mesh

model ngand their normal vectors nrobtained by Eq. 8:

Ln

B=Epon∼Pknr(pon )−ng(pon)k2,(8)

where the normal vectors on the union surface can be di-

rectly determined for any point on the surface as Eq. 9:

nr=∇S(x, y, z)

||∇S(x, y, z)|| =(∂S

∂x ,∂S

∂x ,∂S

∂z )

q(∂S

∂x )2+ ( ∂S

∂y )2+ ( ∂S

∂z )2

,

subject to: ∇S(x, y, z ) = ∇pm∗(x, y, z),

m∗= arg min

m

(pm(x, y, z)).

(9)

Note that m∗is the index of the closest primitive (i.e., the

primitive with the smallest value pm∗(x, y, z)) to the point.

Finally, the total loss Ltotal

Bis the weighting average of all

deﬁned losses with the corresponding weights as Eq. 10:

Ltotal

B=Lsign

B+λnLn

B.(10)

3.2.2 Implementation details

We consider the bounding box C= [−1,1]3and ﬁt the

given input 3D shapes into it by keeping its aspect ratio.

Then we extract 1Mpoints from [−1.1,1.1]3∈R3sur-

rounding the 3D shapes, and at each iteration, we randomly

select 1% of them as jointly inside points pin and outside

points pout for all shapes in the batch B. Moreover, we

pick 10kpoints pon on the surface of each 3D shape and

randomly select 20% of them at each iteration.

To capture the information of the input observation o∈

Xwe employ the pretrained ResNet-18 [18] as the en-

coder. Then three sets of independent fully connected (FC)

layers (4096,4096,4096,55 ×M),(1024,512,256,3×

M),(256, M )decodes the encoded features to obtain the

parameters of each symmetric matrix B, scalar Rand cen-

ter cfor M= 100 primitives in Eq. 2, respectively. All FC

layers except the last layers are empowered by the ReLU

non-linear activation function. We also apply three batch

normalization layers after the ﬁrst three FC layers of Bde-

coder to accelerate training and boost the performance.

To ensure His a PD matrix, we parameterize it as Eq. 11:

H=BBT+αI 0(11)

where α= 0.0001 is a small scalar factor, and Band I

are 10 ×10 symmetric and identity matrices, respectively.

Moreover, we apply the sigmoid function on the output of

the Rdecoder to generate a value in (0,1) ⊂R3as the

parameter Rof Eq. 5to control the size of all primitives

and guarantee that they are not larger than the size of the

bounding box C. Furthermore, we apply tanh on the output

of the cdecoder to generate the centers inside the bounding

box C. Therefore, each primitive is parameterized with 59

parameters in total including three parameter for the center

c= (c1, c2, c3), one parameter for the R, and 55 parameters

for the matrix B.

We set the parameters λon ,λin,λout , and λnas 2, 1, 10,

and 1, respectively. We train our encoder-decoder architec-

ture with Adam optimizer with the initial learning rate 1e-4,

weight decay 1e-7, and batch size of 64. We implement our

model in Python3.7 using PyTorch via CUDA instruction.

4. Experiments

In this section, we provide information about the eval-

uation setups and show qualitative and quantitative results

of our method compared to state-of-the-art methods on sin-

gle RGB image 3D shape reconstruction. We also perform

various ablation studies to analyze our method better. More

experiments are available in the supplementary material.

4.1. Dataset and Metrics

We evaluate our approach on the subset of the ShapeNet

dataset [6] with the same image renderings and train-

ing/testing split provided by Choy et al. [9]. We also em-

ploy mesh-fusion [36] to generate watertight meshes from

the 3D CAD models. We then use Houdini [35] to extract

the inside/outside/on points and the normal vectors. For

evaluation, we use the volumetric IoU, Chamfer [11], and

F-Score [23] metrics. Volumetric IoU is used to measure

the overlapped volume between the ground truth meshes

and the reconstructed surfaces. Chamfer is the mean of the

accuracy and the completeness score. The mean distance of

points on the reconstructed surface to their nearest neigh-

bors on the ground truth mesh is deﬁned as the accuracy

metric. The completeness metric is deﬁned in the opposite

direction of the accuracy metric. F-score is the harmonic

mean of precision which shows the percentage of correctly

reconstructed surfaces. To compute IoU, we sample 100k

points from the bounding box. To evaluate Chamfer and F-

score, we ﬁrst transfer the reconstructed surfaces to meshes,

then similar to CvxNet[10] we sample 100kpoints on the

reconstructed and the ground-truth meshes.

Input RGB Image GT Mesh AtlasNet OccNet CvxNet 3DIAS (ours)SIF

(a)

CvxNet 3DIAS (ours)Input RGB Image

(b)

Figure 4: Qualitative comparison on single RGB image 3D shape reconstruction. SIF [12], AtlasNet [15], OccNet[26],

CvxNet[10], and our 3DIAS output reconstructed 3D shape from the given RGB image. (a) Comparison with other methods

for the samples shown in CvxNet [10]. (b) More qualitative comparisons with CvxNet [10].

Category IoU Chamfer F-Score

P2M AtlasNet OccNet SIF CvxNet 3DIAS P2M AtlasNet OccNet SIF CvxNet 3DIAS AtlasNet OccNet SIF CvxNet 3DIAS

airplane 0.420 - 0.571 0.530 0.598 0.549 0.187 0.104 0.147 0.167 0.093 0.087 67.24 62.87 52.81 68.16 59.48

bench 0.323 - 0.485 0.333 0.461 0.485 0.201 0.138 0.155 0.261 0.133 0.106 54.50 56.91 37.31 54.64 60.17

cabinet 0.664 - 0.733 0.648 0.709 0.730 0.196 0.175 0.167 0.233 0.160 0.123 46.43 61.79 31.68 46.09 61.81

car 0.552 - 0.737 0.657 0.675 0.737 0.180 0.141 0.159 0.161 0.103 0.091 51.51 56.91 37.66 47.33 58.07

chair 0.396 - 0.501 0.389 0.491 0.509 0.265 0.209 0.228 0.380 0.337 0.186 38.89 42.41 26.90 38.49 43.14

display 0.490 - 0.471 0.491 0.576 0.538 0.239 0.198 0.278 0.401 0.223 0.211 42.79 38.96 27.22 40.69 42.40

lamp 0.323 - 0.371 0.260 0.311 0.381 0.308 0.305 0.479 1.096 0.795 0.607 33.04 38.35 20.59 31.41 37.52

speaker 0.599 - 0.647 0.577 0.620 0.638 0.285 0.245 0.300 0.554 0.462 0.351 35.75 42.48 22.42 29.45 39.16

riﬂe 0.402 - 0.474 0.463 0.515 0.423 0.164 0.115 0.141 0.193 0.106 0.116 64.22 56.52 53.20 63.74 47.44

sofa 0.613 - 0.680 0.606 0.677 0.685 0.212 0.177 0.194 0.272 0.164 0.158 43.46 48.62 30.94 42.11 49.73

table 0.395 - 0.506 0.372 0.473 0.509 0.218 0.190 0.189 0.454 0.358 0.245 44.93 58.49 30.78 48.10 57.63

phone 0.661 - 0.720 0.658 0.719 0.751 0.149 0.128 0.140 0.159 0.083 0.080 58.85 66.09 45.61 59.64 71.35

vessel 0.397 - 0.530 0.502 0.552 0.538 0.212 0.151 0.218 0.208 0.173 0.206 49.87 42.37 36.04 45.88 40.70

mean 0.480 - 0.571 0.499 0.567 0.575 0.216 0.175 0.215 0.349 0.245 0.197 48.57 51.75 34.86 47.36 52.22

Table 1: Evaluation of single image 3D shape reconstruction. We evaluate and compare our method (3DIAS) to the state-of-

the-art methods including P2M [41], AtlasNet [15], OccNet[26], SIF [12], and CvxNet[10] on a part of ShapeNet dataset [6]

in terms of IoU, Chamfer, and F-score.

4.2. Reconstruction

We experimentally evaluate our method 3DIAS trained

on multi-class and compare it with state-of-the-art methods

on single RGB image 3D shape reconstruction and summa-

rize the results in Table 1. The experiments demonstrate

the superiority of 3DIAS compared to the explicit-based

methods P2M [41] and AtlasNet [15], the isosurface-based

method OccNet [26], and the recent primitive-based meth-

ods SIF [12] and CvxNet [10] in terms of volumetric IoU

and F-score. We also achieve the second-best performance

with the Chamfer metric. We show more quantitative re-

sults of 3DIAS for the trained network on single-class in

the supplementary material.

Moreover, we qualitatively evaluate 3DIAS trained on

single-class and compare it with the previous methods in

Figure 4. The results illustrate that 3DIAS achieves smooth

surfaces with desirable geometrical details. 3DIAS, unlike

the previous methods, is successful in reconstructing the 3D

shape with more complex topologies (e.g., chair) as shown

in Figure 4a. Moreover, compared to CvxNet [10], 3DIAS

can better reconstruct thin shapes (e.g., lamp) and when

similar shapes are rare in the training dataset (e.g., airplane

and car), see Figure 4b.

4.3. Ablation study

We perform several ablation studies to analyze our pro-

posed representation and reconstruction procedure. First,

we show the ability of our method to generate more com-

plex primitives compared to other primitive-based methods.

Then we compare the required number of parameters to rep-

resent 3D shapes with our representation and other methods.

Finally, we demonstrate the power of our reconstruction

Representation SIF OccNet CvxNet 3DIAS

Num of params 700 11M 7700 480

Table 2: Number of parameters. The average number of pa-

rameters for representing 3D shapes by different methods.

scheme to learn the semantic structures in an unsupervised

manner. Moreover, we evaluate the effects of the designed

constraints and the deﬁned loss functions in reconstructing

3D shapes with high detailed appearances.

Figure 5: Statistics of the number of primitives. We com-

pute the average number of primitives selected among all

M= 100 primitives by the network for each category.

4.3.1 Complexity of primitives

We illustrate that our proposed constrained primitive is

able to form more geometrical (e.g., curved) and topolog-

ical (e.g., genus-one) complex shapes as shown in Fig-

ure 6. While the previous primitive-based methods such

as cubes [40], ellipsoids [12], superquadrics [30], and con-

vexes [10] cannot form such complex shapes.

4.3.2 The number of parameters

In section 3.1 we argue that a primitive may have no real

solution when |p3↓(x, y, z)|<|p4(x, y, z)|or |p3↓(x, y, z)|

is non-negative for all points (x, y, z)∈R3(i.e., no valid

surface). Accordingly, our method can ignore some of the

primitives among all M= 100 primitives by assigning

positive deﬁnite coefﬁcient matrices Ato them and main-

tain a sufﬁcient number of primitives. Therefore, these

not-solvable primitives do not participate in reconstruct-

ing the surface S. Note that, our method selects sets of

primitives that are mainly different for inter-category shapes

and have a large overlap for intra-category shapes. Dur-

ing the test phase, we can efﬁciently check the eigenvalues

(a) Lamp (b) Speaker (c) Chair

Figure 6: The complexity of our primitives. The ﬁrst and

the second rows show the reconstructed shapes and their

corresponding primitives, respectively. The proposed prim-

itive can effectively present curved and torus shapes.

for the coefﬁcient matrix of each primitive and eliminate

the primitives with non-negative eigenvalues. Our experi-

ments demonstrate that our network selects few primitives

to reconstruct 3D shapes, as shown in Figure 5. In addi-

tion, since each quartic primitive can ﬁnally be identiﬁed

with only 35 coefﬁcients aijk , the surface Sof 3D shapes

with 3DIAS representation can be represented with only

35×13.71 '480 number of parameters on average. 3DIAS

requires 68.571%,0.004%, and 6.234% of the parameters

used in SIF [12], OccNet [26], and CvxNet [10] on average

to represent 3D shapes, respectively, see Table 2.

4.3.3 Unsupervised semantics segmentation

We also illustrate that our network learns a semantic struc-

ture without any part-level supervision such that one primi-

tive usually covers the same part of reconstructed 3D shapes

in the same class with 3DIAS representation. We evalu-

ate the semantic structures on the PartNet[27] dataset hav-

ing the labels of hierarchical parts of the shapeNet. The

quantitative experiments in Figure 7show that our method

achieves better and comparable average accuracy compared

to CvxNet [10] and BAE [7], respectively. In addition,

3DIAS achieves better accuracy than both methods for thin

parts (e.g., arm). Moreover, our qualitative experiments il-

lustrate that one primitive tends to cover the same seman-

tic part as shown in Figure 8. This tendency is more pro-

nounced for the dominant primitive that covers more points.

For instance, the dominant primitives mainly cover the seat

of chairs because most of the chairs have seat parts. Please

see the supplementary material for more examples.

Part Accuracy

BAE CvxNet 3DIAS

Back 86.36 91.50 88.87

Seat 73.66 90.63 70.29

Base 88.46 71.95 78.51

Arm 65.75 38.94 74.86

mean 78.56 73.25 78.13

Figure 7: Evaluation of semantic segmentation. (left) The

distribution of PartNet labels within 4 primitives in chair

class. (right) The classiﬁcation accuracy for each part. We

follow the evaluation method introduced in cvxnet [10].

Constraints IoU Chamfer F-Score

-center 0.549 0.387 46.72

-scale 0.559 0.261 48.45

-scale, -closedness 0.546 0.280 44.44

All 0.575 0.197 52.22

Table 3: Ablation study on constraints. We compare the ef-

fects of the center, the scale, and the closedness constraints

in terms of IoU, Chamfer, and F-score. Note that in each

conﬁguration we ignore one or two constraints.

4.3.4 Effects of constraints

We study the effect of our designed constraints on recon-

structing 3D shapes. In each experiment, we evaluate or

baseline by ignoring one or more constraints. The quantita-

tive results based on volumetric IoU, Chamfer, and F-Score

show the importance of each constraint, see Table 3. We be-

lieve these constraints encourage the network to reconstruct

a detailed 3D shape, especially the center constraint.

4.3.5 Effects of losses

While Lsign for the inside/outside points tries to distinguish

inside and outside of 3D shapes, it is not enough to achieve

a detailed surface due to the lack of sample points near the

surface. Therefore, points on the surface and their normal

vectors can facilitate the reconstruction. Note that normal

vectors carry important information on 3D geometry, such

as the local orientation of surfaces. Accordingly, we use

Lsign loss and Lnloss for the points on the surface to better

approximate the surfaces. We evaluate the effects of each

Lsign,Ln, and their combination by excluding them for the

points on the surface and summarize the results in Table 4.

Note that for all the experiments in Table 4we do not ex-

(a) Airplane (b) Chair

Figure 8: Qualitative results on unsupervised semantic seg-

mentation. We visualize the results of 3DIAS for some sam-

ples in the categories of (a) airplane and (b) chair.

Losses IoU Chamfer F-Score

-Ln0.568 0.210 49.17

-Lsign 0.548 0.219 43.77

-Lsign, -Ln0.542 0.232 42.37

All 0.575 0.197 52.22

Table 4: Ablation study on losses. We compare the effects

of Lsign and Lnlosses for the points on the surface in terms

of IoU, Chamfer, and F-score. Note that in each conﬁgura-

tion we ignore one or two loss functions. Moreover, we do

not exclude the Lsign for the inside/outside points. Please

see the supplementary material for more examples.

clude Lsign for the inside/outside points. The results indi-

cate the importance of points on the surface to achieve more

detailed 3D shapes.

5. Conclusion

In this paper, we propose a primitive-based representa-

tion and a learning scheme in which the primitives are learn-

able implicit algebraic surfaces that can jointly approximate

3D shapes. We design various constraints and loss functions

to achieve high-quality and detailed 3D shapes. We exper-

imentally demonstrate that our method outperforms state-

of-the-art methods in most of the metrics. Moreover, we il-

lustrate that our method can learn semantic meanings with-

out part-level supervision by automatically selecting sets of

primitives parametrized by only a few parameters. In the

future, we will utilize the solvability of the designed prim-

itives to develop a soft renderer which leads to reconstruct

the 3D shapes with self-supervised learning.

Acknowledgement

This work was supported in part by an IITP grant funded

by the Korean government [No. 2021-0-01343, Artiﬁcial

Intelligence Graduate School Program (Seoul National Uni-

versity)].

References

[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and

Leonidas J. Guibas. Representation learning and adversarial

generation of 3d point clouds. CoRR, 2017.

[2] C. Bajaj. The emergence of algebraic curves and surfaces in

geometric design. 1992.

[3] Fausto Bernardini, J. Mittleman, Holly Rushmeier, Cl´

audio

Silva, and Gabriel Taubin. The ball-pivoting algorithm for

surface reconstruction. Visualization and Computer Graph-

ics, IEEE Transactions on, 5:349 – 359, 11 1999.

[4] Andrew Brock, Theodore Lim, James M Ritchie, and

Nick Weston. Generative and discriminative voxel mod-

eling with convolutional neural networks. arXiv preprint

arXiv:1608.04236, 2016.

[5] F. Calakli and Gabriel Taubin. Ssd: Smooth signed distance

surface reconstruction. Computer Graphics Forum, 30:1993

– 2002, 11 2011.

[6] Angel X. Chang, Thomas A. Funkhouser, Leonidas J.

Guibas, Pat Hanrahan, Qi-Xing Huang, Zimo Li, Silvio

Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong

Xiao, Li Yi, and Fisher Yu. Shapenet: An information-rich

3d model repository. CoRR, 2015.

[7] Zhiqin Chen, Kangxue Yin, Matthew Fisher, Siddhartha

Chaudhuri, and Hao Zhang. Bae-net: Branched autoen-

coder for shape co-segmentation. Proceedings of Interna-

tional Conference on Computer Vision (ICCV), 2019.

[8] Zhiqin Chen and Hao Zhang. Learning implicit ﬁelds for

generative shape modeling. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition, pages

5939–5948, 2019.

[9] Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin

Chen, and Silvio Savarese. 3d-r2n2: A uniﬁed approach for

single and multi-view 3d object reconstruction. CoRR, 2016.

[10] Boyang Deng, Kyle Genova, Soroosh Yazdani, Soﬁen

Bouaziz, Geoffrey Hinton, and Andrea Tagliasacchi. Cvxnet:

Learnable convex decomposition. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 31–44, 2020.

[11] Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point

set generation network for 3d object reconstruction from a

single image. CoRR, 2016.

[12] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna,

William T. Freeman, and Thomas A. Funkhouser. Learn-

ing shape templates with structured implicit functions. 2019

IEEE/CVF International Conference on Computer Vision

(ICCV), 2019.

[13] Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, and Ab-

hinav Gupta. Learning a predictable and generative vector

representation for objects. CoRR, 2016.

[14] Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan

Russell, and Mathieu Aubry. AtlasNet: A Papier-Mˆ

ach´

e Ap-

proach to Learning 3D Surface Generation. In Proceedings

IEEE Conf. on Computer Vision and Pattern Recognition

(CVPR), 2018.

[15] Thibault Groueix, Matthew Fisher, Vladimir G. Kim,

Bryan C. Russell, and Mathieu Aubry. Atlasnet: A papier-

mˆ

ach´

e approach to learning 3d surface generation. CoRR,

2018.

[16] X. Han, Hamid Laga, and M. Bennamoun. Image-based 3d

object reconstruction: State-of-the-art and trends in the deep

learning era. IEEE transactions on pattern analysis and ma-

chine intelligence, 2019.

[17] Christian H¨

ane, Shubham Tulsiani, and Jitendra Malik. Hi-

erarchical surface prediction for 3d object reconstruction.

CoRR, 2017.

[18] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep

residual learning for image recognition. 2016 IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

2016.

[19] John F. Hughes, Andries van Dam, Morgan McGuire,

David F. Sklar, James D. Foley, Steven K. Feiner, and Kurt

Akeley. Computer Graphics - Principles and Practice, 3rd

Edition. Addison-Wesley, 2014.

[20] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and

Jitendra Malik. End-to-end recovery of human shape and

pose. CoRR, 2017.

[21] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and

Jitendra Malik. Learning category-speciﬁc mesh reconstruc-

tion from image collections. CoRR, 2018.

[22] D. Keren, D. Cooper, and J. Subrahmonia. Describing com-

plicated objects by implicit polynomials. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 16(1):38–53,

1994.

[23] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen

Koltun. Tanks and temples: Benchmarking large-scale scene

reconstruction. ACM Trans. Graph., 36(4), July 2017.

[24] C. Kong, C. Lin, and S. Lucey. Using locally corresponding

cad models for dense 3d reconstructions from a single image.

In 2017 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 5603–5611, 2017.

[25] D. Maturana and S. Scherer. Voxnet: A 3d convolutional

neural network for real-time object recognition. In 2015

IEEE/RSJ International Conference on Intelligent Robots

and Systems (IROS), pages 922–928, 2015.

[26] Lars M. Mescheder, Michael Oechsle, Michael Niemeyer,

Sebastian Nowozin, and Andreas Geiger. Occupancy net-

works: Learning 3d reconstruction in function space. CoRR,

2018.

[27] Kaichun Mo, Shilin Zhu, Angel X. Chang, L. Yi, Subarna

Tripathi, L. Guibas, and H. Su. Partnet: A large-scale bench-

mark for ﬁne-grained and hierarchical part-level 3d object

understanding. 2019 IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), 2019.

[28] Federico Monti, Davide Boscaini, Jonathan Masci,

Emanuele Rodol`

a, Jan Svoboda, and Michael M. Bronstein.

Geometric deep learning on graphs and manifolds using

mixture model cnns. CoRR, 2016.

[29] Jeong Joon Park, Peter Florence, Julian Straub, Richard

Newcombe, and Steven Lovegrove. Deepsdf: Learning con-

tinuous signed distance functions for shape representation.

In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, 2019.

[30] Despoina Paschalidou, Ali O. Ulusoy, and Andreas Geiger.

Superquadrics revisited: Learning 3d shape parsing beyond

cuboids. 2019 IEEE/CVF Conference on Computer Vision

and Pattern Recognition (CVPR), pages 10336–10345, 2019.

[31] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and

Leonidas J. Guibas. Pointnet: Deep learning on point sets

for 3d classiﬁcation and segmentation. CoRR, 2016.

[32] Charles Ruizhongtai Qi, Hao Su, Matthias Nießner, Angela

Dai, Mengyuan Yan, and Leonidas J. Guibas. Volumetric and

multi-view cnns for object classiﬁcation on 3d data. CoRR,

2016.

[33] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J.

Guibas. Pointnet++: Deep hierarchical feature learning on

point sets in a metric space. CoRR, 2017.

[34] Michael I. Rosen. Niels hendrik abel and equations of

the ﬁfth degree. The American Mathematical Monthly,

102(6):495–505, 1995.

[35] SideFX. Houdini, 2020.

[36] David Stutz and Andreas Geiger. Learning 3d shape comple-

tion under weak supervision. CoRR, 2018.

[37] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.

Octree generating networks: Efﬁcient convolutional archi-

tectures for high-resolution 3d outputs. In The IEEE Inter-

national Conference on Computer Vision (ICCV), 2017.

[38] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.

Octree generating networks: Efﬁcient convolutional archi-

tectures for high-resolution 3d outputs. pages 2107–2115,

10 2017.

[39] Maxim Tatarchenko, Stephan R. Richter, Rene Ranftl,

Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do

single-view 3d reconstruction networks learn? In The IEEE

Conference on Computer Vision and Pattern Recognition

(CVPR), June 2019.

[40] Shubham Tulsiani, H. Su, L. Guibas, Alexei A. Efros, and

Jitendra Malik. Learning shape abstractions by assembling

volumetric primitives. 2017 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), 2017.

[41] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei

Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh

models from single RGB images. CoRR, 2018.

Supplementary Material for

3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces

Mohsen Yavartanoo∗Jaeyoung Chung∗Reyhaneh Neshatavar Kyoung Mu Lee

ASRI, Department of ECE, Seoul National University, Seoul, Korea

{myavartanoo,robot0321,reyhanehneshat,kyoungmu}@snu.ac.kr

Category IoU Chamfer F-Score #primitive

Multi Single Multi Single Multi Single Multi Single

airplane 0.549 0.621 0.087 0.580 59.48 69.51 6.915 23.42

bench 0.485 0.462 0.106 0.677 60.17 59.60 16.97 22.24

cabinet 0.730 0.726 0.123 0.141 61.81 53.24 20.91 22.98

car 0.737 0.747 0.091 0.082 58.07 61.08 10.02 37.96

chair 0.509 0.493 0.186 0.304 43.14 38.85 17.17 19.15

display 0.538 0.511 0.211 1.137 42.40 37.85 13.88 14.95

lamp 0.381 0.352 0.607 1.494 37.52 35.96 9.946 17.54

speaker 0.638 0.632 0.351 0.310 39.16 34.19 19.24 19.65

riﬂe 0.423 0.509 0.116 0.760 47.44 61.60 4.719 17.16

sofa 0.685 0.667 0.158 0.186 49.73 45.20 21.35 24.55

table 0.509 0.478 0.245 0.369 57.63 50.96 15.97 16.68

phone 0.751 0.734 0.080 0.168 71.35 65.85 14.11 18.23

vessel 0.538 0.550 0.206 0.200 40.70 43.89 7.067 19.41

mean 0.575 0.575 0.197 0.493 52.22 50.60 13.71 21.07

Table S1: Comparison of multi-class and single-class sin-

gle RGB image 3D shape reconstruction in terms of IoU,

Chamfer, F-score, and the number of primitives.

S. 3DIAS analysis

In this section, we provide more experimental analyses

of our proposed 3DIAS method. First, we prove that our

scale constraint satisﬁes the criteria of the closedness con-

straint. Then we quantitatively and qualitatively show and

compare our experiments of 3DIAS trained on single-class

and multi-class. Finally, we show the effect of viewpoint in

reconstructing 3D shapes with our proposed method.

S.1. Holding closedness constraint

In the section 3.1.2, we claim that the parametrized coef-

ﬁcient matrix Aholds the criteria for the closedness con-

straint. To ensure the closedness constraint is satisﬁed,

the coefﬁcient matrix A[5:10] in Eq. 3of the fourth-degree

terms p4(x, y, z)must be positive deﬁnite. The matrix

A[5:10], as the principal sub-matrix of the coefﬁcient ma-

*equal contribution

trix Aof p(x, y, z), is the summation of the correspond-

ing sub-matrices H[5:10] and Q[5:10]. Since the matrix His

assumed positive deﬁnite, its principal sub-matrices (e.g.,

H[5:10]) are also positive deﬁnite. Moreover, the corre-

sponding sub-matrix Q[5:10] of the coefﬁcient matrix Qis

a diagonal matrix with the values [1,1,1,0,0,0]; hence it is

positive semi-deﬁnite. Accordingly, the sub-matrix A[5:10]

as the summation of a positive deﬁnite matrix and a positive

semi-deﬁnite matrix is also positive deﬁnite, which satisﬁes

the closedness constraint.

S.2. Multi-class vs single-class training

We also evaluate our method for the trained network in-

dividually on each class and compare the results in terms

of IoU, Chamfer, and F-score with the multi-class case and

summarize the results in Table S1. Surprisingly, the com-

parison demonstrates that our method trained on the multi-

class can better reconstruct on average, presumably due to

the overﬁtting to the training set. However, our method

trained on the single-class selects more primitives on av-

erage to reconstruct 3D shapes as shown in Table S1. We

believe it is because more available primitives are per class

to represent 3D shapes in the single-class training, while

the network must distribute the limited primitives among

all categories in the multi-class training.

We also visualize the reconstructed 3D shapes by 3DIAS

for both the single-class and the multi-class cases in Fig-

ure S2. The results show that the primitives share the same

semantically meaning among the 3D shapes in the same

category in both single-class and multi-class cases. Based

on our qualitative and quantitative experiments, we believe

that the Chamfer is not a suitable and reliable metric for the

3D shape reconstruction task. For instance, Chamfer shows

better performance for multi-class training on airplane and

riﬂe categories, while the qualitative results show better ap-

pearances for single-class training.

RGB

Image

Single

Multi

Airplane

RGB

Image

Single

Multi

Bench

RGB

Image

Single

Multi

Cabinet

RGB

Image

Single

Multi

Car

RGB

Image

Single

Multi

Chair

RGB

Image

Single

Multi

Display

RGB

Image

Single

Multi

Lamp

RGB

Image

Single

Multi

Speaker

RGB

Image

Single

Multi

Riﬂe

RGB

Image

Single

Multi

Sofa

RGB

Image

Single

Multi

Table

RGB

Image

Single

Multi

Phone

RGB

Image

Single

Multi

Vessel

Figure S2: Qualitative comparison of single-class and multi-class cases. We visualize the results for some samples in each

category. The ﬁrst, second, and third lines for each category show the input RGB images, single-class training results, and

multi-class training results, respectively.

S.3. Effect of viewpoint

We also show the effect of viewpoint in Figure S3. The

results illustrate that the quality of the reconstructed 3D

shapes is highly dependent on the point of view when sim-

ilar shapes are rare in the training dataset (e.g., bottom air-

plane). However, we show that this effect decreases when

there are enough similar shapes to the query shape (e.g., top

airplane) in the training dataset.

RGB Image Aligned Top-view Front-view Side-view

Figure S3: Effect of viewpoint in 3D shape reconstruction. We visualize the reconstructed 3D shape by 3DIAS for two

samples. (top) and (bottom) show two airplanes with small and large viewpoint effects on their appearances, respectively.