Conference PaperPDF Available

PolyNet: Polynomial Neural Network for 3D Shape Recognition with PolyShape Representation

Authors:

Abstract and Figures

3D shape representation and its processing have substantial effects on 3D shape recognition. The polygon mesh as a 3D shape representation has many advantages in computer graphics and geometry processing. However, there are still some challenges for the existing deep neural network (DNN)-based methods on polygon mesh representation , such as handling the variations in the degree and permutations of the vertices and their pairwise distances. To overcome these challenges, we propose a DNN-based method (PolyNet) and a specific polygon mesh representation (PolyShape) with a multi-resolution structure. PolyNet contains two operations; (1) a polynomial con-volution (PolyConv) operation with learnable coefficients, which learns continuous distributions as the convolutional filters to share the weights across different vertices, and (2) a polygonal pooling (PolyPool) procedure by utilizing the multi-resolution structure of PolyShape to aggregate the features in a much lower dimension. Our experiments demonstrate the strength and the advantages of PolyNet on both 3D shape classification and retrieval tasks compared to existing polygon mesh-based methods and its superiority in classifying graph representations of images. The code is publicly available from this link.
Content may be subject to copyright.
PolyNet: Polynomial Neural Network for 3D Shape Recognition with PolyShape
Representation
Mohsen Yavartanoo1Shih-Hsuan Hung2Reyhaneh Neshatavar1Yue Zhang2Kyoung Mu Lee1
1SNU ECE & ASRI 2Oregon State University
{myavartanoo,reyhanehneshat,kyoungmu}@snu.ac.kr {hungsh,zhangyue}@oregonstate.edu
Abstract
3D shape representation and its processing have sub-
stantial effects on 3D shape recognition. The polygon
mesh as a 3D shape representation has many advantages
in computer graphics and geometry processing. However,
there are still some challenges for the existing deep neu-
ral network (DNN)-based methods on polygon mesh rep-
resentation, such as handling the variations in the degree
and permutations of the vertices and their pairwise dis-
tances. To overcome these challenges, we propose a DNN-
based method (PolyNet) and a specific polygon mesh rep-
resentation (PolyShape) with a multi-resolution structure.
PolyNet contains two operations; (1) a polynomial con-
volution (PolyConv) operation with learnable coefficients,
which learns continuous distributions as the convolutional
filters to share the weights across different vertices, and
(2) a polygonal pooling (PolyPool) procedure by utilizing
the multi-resolution structure of PolyShape to aggregate
the features in a much lower dimension. Our experiments
demonstrate the strength and the advantages of PolyNet on
both 3D shape classification and retrieval tasks compared
to existing polygon mesh-based methods and its superiority
in classifying graph representations of images. The code is
publicly available from this link.
1. Introduction
In recent years, increasing applications of 3D shapes rep-
resentation have made it a fundamental problem in com-
puter vision, computer graphics, and augmented reality.
The structure and the high-quality appearance of the rep-
resentation significantly impact many tasks, such as 3D
shape classification and retrieval. With the advent of deep
neural network (DNN) architectures, several methods have
been proposed to learn 3D shapes. Generally, these meth-
ods can be categorized into four groups based on the
input shape representation; point clouds [47,48], voxel
grids [64,41], 2D projections [56,52,50,9,63], and poly-
gon meshes [20,13,43,40]. The point clouds suffer from
harsh noises and wasted substantial structural information
of the 3D shapes. The voxel grids require large memory,
and also rendering voxel grids generates unnecessarily vo-
luminous data and quantization artifacts. Furthermore, 2D
projection representations encounter severe self-occlusions.
By contrast, a polygon mesh is a collection of vertices
and faces that defines a 3D shape smoothly and entirely.
Therefore, this representation contains structural informa-
tion without any harsh noises, severe artifacts, and self-
occlusions. Additionally, it is a memory-efficient represen-
tation that can store the full geometry details by reducing
unnecessary voluminous data. However, in polygon mesh-
based methods, weight sharing is still a challenging prob-
lem due to variations in the degree of vertices, the permuta-
tion of adjacent vertices, and their pairwise distances.
To overcome the limitations of the polygon mesh-based
methods, in this work, we propose PolyNet, a novel network
that can effectively learn and extract features of a polygon
mesh representation of 3D shapes by a continuous polyno-
mial convolution (PolyConv). PolyConv is a polynomial
function with learnable coefficients which learns continu-
ous distributions as the convolutional filters to share the cor-
responding weights among the features of the vertices in the
local patches made from each vertex and its adjacent ver-
tices on the surface. This operation is invariant to the num-
ber of adjacent vertices, their permutations, and their pair-
wise distances nearby the central vertex in the local patch.
Moreover, we design PolyShape representation, a specific
polygon mesh representation with a multi-resolution struc-
ture. We utilize this multi-resolution attribute to design our
PolyPool operation and apply it after each PolyConv layer.
This PolyPool operation reduces the mesh resolution by a
fixed factor at each layer. We achieve the best classification
accuracy and mean Average Precision (mAP) compared to
the previous methods based on voxel grid and polygon mesh
and comparable performance to point cloud-based methods.
We also show the superiority of our designed PolyConv on
the challenging 75 Superpixel MNIST dataset. We summa-
rize the main contributions of our method as follows:
We propose PolyNet, a novel neural network method
with a continuous convolution operation invariant to
the number of adjacent vertices, their permutations,
and their pairwise distances in 3D shapes.
We employ PolyNet on PolyShape, a polygon mesh
representation with a multi-resolution structure that
enables us to use a pooling operation named PolyPool.
We achieve an improvement in classification and re-
trieval tasks compared to the previous mesh-based
methods on the ModelNet dataset and the best clas-
sification performance on the 75 Superpixel MNIST.
2. Related work
In this section, we review the related works based on the
representation of the input 3D shapes: point cloud, voxel
grid, 2D projection, and polygon mesh.
Point cloud. PointNet [47], as a simple and effective DNN-
based method on point clouds, learns the features directly
from each point and aggregate them as one global repre-
sentation. However, extracting local structures is important
for the success of convolutional architectures. To overcome
the lack of local structure of this method, PointNet++ [48]
proposes a hierarchical neural network that employs Point-
Net [47] on the group of points divided into overlapping
local patches. Te et al. [57] and Wang et al. [60] utilized the
GraphCNNs to learn the features from a local graph formed
by the connection of the adjacent points. These graph-based
methods produce little shape information since they do not
explicitly represent the local neighboring points in an or-
dered alignment.
Voxel grid. 3D ShapeNets [64] and VoxNet [41] transfer a
3D shape to a structured binary 3D grid called voxel gird.
Then they learn the global features from the voxels by ex-
tending the CNN architectures from 2D to 3D convolutions.
To reduce the computational complexity on the sparse vox-
els, Riegler et al. [49] and Wang et al. [61] applied the oc-
tree data structure. However, these methods require heavy
computations and unnecessarily voluminous data.
2D projection. MVCNN [56] and RotationNet [24] learn
the features of a 3D shape over a multi-view rendered 2D
images based on conventional 2D CNNs. Moreover, pool-
ing operations aggregate these feature values to reduce the
rotation effects of the 3D shape [52,16,56]. However, these
pooling operations lose a lot of geometric details among the
views, such as two surfaces occluded to each other. Se-
qViews2SeqLabels [19] and SPNet VE [63] aggregate the
information among the sequential views by considering the
view specific importance to prevent the lost information. On
the other hand, DeepPano [52] and PANORAMA-NN [50]
consider a panoramic view of the 3D shape. They project
the shape into a cylinder surrounding it to accumulate the
contents of multiple views altogether. To extend the num-
ber of viewpoints, utilizing a sphere instead of the cylin-
der leads CNNs to cover all views and learn more robust
features consistent with rotations [9,63]. However, these
image-based methods suffer from self-occlusions.
Polygon mesh. A polygon mesh is a discrete representa-
tion of the surface of a 3D shape with faces and vertices.
This representation can be expressed as a graph; accord-
ingly, any graph-based methods can be applied to it. The
existing graph-based methods are classified into two main
categories: spectral methods [8,22,10,26,32] and spatial
methods [42,2,45,17,14,40,43,23]. The convolution
operation in the spectral domain is defined by the eigende-
composition of the graph Laplacian, where the eigenvectors
are the same as the Fourier basis [8]. This process is basis-
dependent, which indicates applying the learned parameters
producing different features on a new domain [37]. More-
over, this operation is non-localized filtering in the spec-
tral domain [10]. An efficient method to solve the non-
localization problem is approximating the local spectral fil-
ters via the Chebyshev polynomial expansion [10]. On the
other hand, there is no easy way to induce the weight shar-
ing across different locations of the graph due to the dif-
ficulty of matching local neighborhoods in the spatial do-
main [8]. Nevertheless, Atwood and Towsley [2] proposed
a spatial filtering method that assumes information is trans-
ferred from a vertex to its adjacent vertex with a specific
transition probability. The power of the transition probabil-
ity matrix implies that farther adjacent vertices provide little
information for the central vertex. Furthermore, Geodesic
CNN [40], MoNet [43], and SplineCNN [14] deal with the
weight sharing problem by designing local coordinate sys-
tems for the central vertex in a local patch. They apply a
set of weighting functions to aggregate features on adja-
cent vertices. Then they compute a learnable weighted aver-
age of these aggregated features as the spatial convolution.
However, these methods are computationally expensive and
require predefined local systems of coordinates. Moreover,
Neural3DMM [5] introduces the spiral convolution opera-
tion by enforcing a local ordering of vertices through the
spiral operator. An initial point for each spiral is a vertex
with the shortest geodesic path to a fixed reference point
on a template shape. The remaining vertices of the spiral
are ordered in the clockwise or counterclockwise directions
inductively. However, finding a reference point for an arbi-
trary shape is challenging. Moreover, the initial point is not
unique once two or more adjacent vertices have the same
shortest path to the reference point.
FC+BatchNorm+ReLU+Dropout
PolyShape
64 128 256 512 1024
Poly-Conv+InstanceNorm+Tanh
Poly-Pool (max)
Global Avg-Pool
#classes
FC+BatchNorm+LogSoftmax
1024
Figure 1: The overview of PolyNet architecture. PolyNet takes a PolyShape as an input and applies four PolyConv followed
by instance normalization and pooling layers and three fully connected (FC) followed by batch normalization layers to learn
the local and the global features of the shape. Then, we employ the hyperbolic tangent and ReLU activation functions to
empower PolyConv and FC layers, respectively. Moreover, PolyPool layers reduce the spatial dimensions and minimize the
overfitting by utilizing the multi-resolution structure of the PolyShape. The global average pooling layer avoids the permuta-
tion ambiguities of the vertices. Note that Vand Frefer to the number of vertices and faces for each shape, respectively.
3. PolyNet
In this section, we explain the details of our PolyNet ar-
chitecture and its consistency to the number of adjacent ver-
tices, their permutations, and their pairwise distances in 3D
shapes. PolyNet learns the features locally by PolyConv
operation, and performs PolyPool procedure by utilizing
the multi-resolution structure of our designed PolyShape
representation. Figure 1shows the overview of our PolyNet
architecture. A 3D shape with PolyShape representation
passes through a straightforward network with three Poly-
Conv layers followed by PolyPool layers and another Poly-
Conv layer with a global average pooling to learn and ex-
tract the features. Then three fully connected layers classify
the shape with these extracted features.
3.1. Polynomial convolution operation
To overcome the challenges of the weight sharing across
different vertices in the conventional CNNs and GraphC-
NNs, we propose PolyConv operation, which learns a prob-
ability density function (PDF) as a convolutional filter. Let
us assume that the surface of a 3D shape is a differen-
tial manifold M. For a point vand its neighbor uin a
local patch N(v)on the manifold M, we define signals
x:M → [1,1] and y:M → [1,1] as the fea-
tures at those points, respectively. Without loss of gener-
ality, we consider the convolutional weights of the standard
CNN as the probability distributions. We then argue that
a patch operation D(v)in the standard CNN can be ex-
pressed as an expected value over the features on sample
points Ns(v)⊂ N(v)surrounding vas Eq. 1:
D(v) = E[y|x] = X
u∈Ns(v)
w(u)y(u),(1)
where w(u)is the corresponding probability (weight) to the
point u. However, in a general graph or polygon mesh, the
locations of the adjacent points are in a continuous domain
and can vary; hence, it is not possible to assign a discrete
distribution as the weights. Therefore, we assume that there
is an unknown conditional PDF that can express the convo-
lution filter weights, and then we can formulate the expected
value over each patch as Eq. 2:
D(v) = E[y|x] = ZN(v)
yf (y|x)dy, (2)
where f(y|x)is the conditional probability of the feature y
on point uin the neighborhood of the central point vwith
the given feature x. The conditional probability f(y|x)can
be written as Eq. 3:
f(y|x) = f(x, y)
fx(x)=f(x, y)
R1
1f(x, y)dy ,(3)
where fx(x)is the marginal distribution that can be ob-
tained by integrating the joint probability distribution,
f(x, y), over y. Note that, since the value of feature yis
defined in [1,1] R,R1
1f(x, y)dy is a definite integral
on the interval [1,1] R. We reformulate f(y|x)by ap-
proximating f(x, y)with a polynomial function of xand y
by considering a certain degree das Eq. 4:
f(y|x) = P
0i,j,i+jd
ai,j xiyj
P
0id
bixi,(4)
where the coefficients bican be directly obtained by com-
puting the marginal distribution fx(x)from f(x, y). To en-
sure that the polynomial function as a PDF is always pos-
itive, the coefficient matrix Ain the compact form of the
Input Local Patch Input Features Expected Values Output Features
Features at each channel
Convolution Weights
Input Local Patch Input Features Expected Values Output Features
Features at each channel
Convolution Weights
(a) Unsqueezed operations (b) Squeezed operations
Figure 2: PolyConv operations over a local patch. A set of conditional PDFs approximated with polynomial functions
with learnable coefficients is applied as the convolutional filters to learn the input features on a local patch of vertices
Ns(v)⊂ N(v). (a) The unsqueezed operation includes C×C0conditional PDFs as the convolutional filterers which map
the input features to the higher dimensional output features. (b) The squeezed operation contains only Cconditional PDFs
which is combined with a fully connected layer to map the input features to the higher dimensional output features. Note that
Cand C0refer to the size of input channels and output channels, respectively.
polynomial function as Eq. 5must be positive definite.
f(x, y) = X
0i,j,i+jd
ai,j xiyj
=XTAX > 0,
(5)
where Xis the vector of variables xand ywith degrees of
less or equal d/2. Therefore, instead of learning the coef-
ficient matrix A, we parameterize it as A=BBT0.
Indeed, we approximate the conditional PDF f(x|y)as the
continuous convolutional filters with the polynomial func-
tions, which are parameterized by the learnable symmetric
matrix B. For more details, refer to the supplementary ma-
terial. The large degree of freedom in polynomial functions
allows approximating any complex distributions.
It is important to note that since we only have few sam-
ples Ns(e.g., points, vertices, etc.) in each local patch N
on the manifold M, computing the exact expected value
over each local patch is not possible with Eq. 2. Therefore,
we approximate the integral by taking the weighted average
over these sample points as Eq. 6:
ZN(v)
yf (y|x)dy '1
|Ns(v)|X
u∈Ns(v)
yf (y|x).(6)
Finally, we design unsqueezed and squeezed operations
based on the proposed patch operator as the convolutions to
learn from the input features, as shown in Figure 2. In the
first approach, similar to the conventional CNNs, we con-
sider multiple conditional PDFs corresponding to the input
and the output channels as the convolution filters. How-
ever, this operation requires heavy computations and large
memory usage. Therefore, we squeeze it by allocating dif-
ferent conditional PDFs to only the input channels and ag-
gregate the results by a fully connected layer. The second
approach is beneficial when the number of input vertices is
large. Therefore, with this continuous convolution opera-
tion, we can locally learn the features from the surface of
3D shapes, which is invariant to the number of vertices in a
local patch, their permutation, and their pairwise distances.
3.2. Polygonal shape representation and pooling
To apply pooling after each convolution operation, we
present PolyShape representation with a multi-resolution
structure made of a sequence of the subdivisions and shape
fittings, as shown in Figure 3. This multi-resolution struc-
ture enables the pooling operations without any learnable
parameter, which is similar to the multi-level pooling on
images. Moreover, PolyShape maintains the structural de-
tails and the topology of the shape after each pooling and
provides a semi-regular structure that benefits the analysis
of the local structure of the shape (i.e., each vertex and its
corresponding neighborhood) [46].
For PolyShape processing, we first employ the mesh fu-
sion [55] to a given 3D CAD model for the abstraction of the
shapes with a simpler topology. Next, we fix the geometric
errors of the meshes, such as the non-manifold edges and
double vertices, and then reduce the number of the vertices
to obtain a coarse mesh with nearly 400 vertices. Lastly,
PolyShape
3D CAD Model
X
Y
Z
Coarse Subdivision Level 1 Fitting Level 1 Subdivision Level 2 Fitting Level 2 Subdivision Level 3
Vertices of Levels 1
Vertices of Levels 2
Vertices of PolyShape
Vertices of 3D CAD
Vertices of Coarse
Preprocessing
Shape Fitting
Subdivision
Figure 3: The overview of PolyShape processing. A given 3D CAD model passes through a preprocessing pipeline to produce
a coarse polygon mesh with a simpler topology and single connected component. Next, the subdivisions and shape fitting
procedures sequentially create PolyShape with the multi-resolution structure for the shape.
we subdivide the coarse mesh and fit the resulting mesh to
the given model to restore the details of the original shape.
We apply this sub-division routine iteratively, as frequent
as the number of the pooling layers in the PolyNet (i.e., 3
times). For the subdivision, there are two common meth-
ods: the primal triangle quadrisection (PTQ) [39] and the
3-subdivision [28]. PTQ is a straightforward approach
that splits a triangle into four sub-triangles. It creates new
vertices on each edge in the original mesh and connects
them to each of the other new vertices from the same face.
The other strategy, 3-subdivision, adds the new vertices
inside each triangle in the original mesh and connects the
new vertices to each of its three old surrounding vertices
and adjacent new vertices. Every two iterations of the 3-
subdivision separate each original triangle into nine sub-
triangles. Thus, PolyShapes have fewer triangles by the 3-
subdivision than the PTQ. We evaluate the effectiveness of
the PolyShapes made by both subdivisions in Section 4.
With the multi-resolution structure of PolyShape, we can
downsample the output of PolyConv layers by collapsing
the neighboring vertices to each interior vertex (i.e., the ver-
tices of the coarser meshes), shown in Figure 4. The PTQ
and 3-subdivision upsample the mesh vertices by adding
another vertex at the center of each edge and each trian-
gle, respectively. The downsampling procedures are accom-
plished as the inverse process of the upsampling methods,
which allow us to generate relatively larger polygons, as
shown in Figure 4. Therefore, we can reduce the number
of polygons by the factors of four and three by employing
the PTQ and 3-subdivision methods, respectively. We use
these downsampling procedures as the pooling, which facil-
itates aggregating the features, where each vertex Von the
downsampled mesh takes the maximum (max-pool) over
(a) Original (b) PTQ (c) -subdivision
Figure 4: PolyPool operations. The black dots and red dots
show the vertices before and after PolyPool, respectively.
Dash lines indicate the omitted edges after the subdivisions.
features values of the vertices v∪ {ui}m
i=1 in a local patch
on the output of PolyConv layers. Therefore, PolyPool en-
ables describing a 3D shape with a large number of vertices
by aggregating the features. These aggregated features are
much lower in dimension compared to using all of the ex-
tracted features and also can improve the performance like
the conventional pooling operation of the 2D CNNs [44].
4. Experiments
In this section, we present the details of the datasets and
several experiments of PolyNet on the 3D shapes with and
without PolyShape representation and graph representation
of images. We compare our proposed method with the state-
of-the-art methods on both classification and retrieval tasks.
We use Adam optimizer in all of our experiments with the
initial learning rate as 1.e-3 and 1.e-2 for unsqueezed and
squeezed cases, respectively. We set a mini-batch size as
100 and 10 for experiments on 3D shapes and graph rep-
resentation of images, respectively. We choose hyperbolic
tangent for the activation functions on PolyConv layers to
(a) Joint distributions (b) Marginal distributions
Figure 5: Joint and marginal distributions. Visualization of (a) the learned joint distributions f(x, y)and (b) the marginal
distributions fx(x)approximated by polynomial functions of degree d= 2 on the ModelNet-10 with the 3-subdivision.
guarantee that the input features to the next layers are in the
interval [1,1] R, and we use cross-entropy loss between
the model predicted scores and ground truth labels. We
also implement our model in Python3.6 using PyTorch via
CUDA instruction. PolyShape processing, including both
subdivision methods, takes 92 ms for one CAD model on
average, and we ensure all the conversions are successful
for the ModelNet dataset. The average testing times for the
baseline of PolyNet per shape for the 3-subdivision and
PTQ are 13 ms and 18 ms, respectively. We will publish the
code for both PolyShape processing and PolyNet.
4.1. Datasets
In our experiments, we use both the ModelNet-10 and
the ModelNet-40 datasets [64] containing 4,899 CAD mod-
els (3991 for training and 908 for testing) in 10 cate-
gories and 12,311 CAD models (9843 for training and
2468 for testing) in 40 categories, respectively. We apply
our PolyShape processing to the CAD models with Hou-
dini [53], a popular 3D modeling software. For more de-
tails about the PolyShape pipeline, refer to supplemental
material. Additionally, we translate and scale the result-
ing PolyShapes into the bounding box [1,1]3R3. We
extract the coordinates (x, y, z)R3and the normal vec-
tors (nx, ny, nz)R3, for all vertices as the first input into
PolyNet. We also use the MNIST dataset [31], which in-
cludes 28×28 images. These images are represented as dif-
ferent graphs so that each vertex and each edge corresponds
to a superpixel and the spatial relation between two super-
pixels, respectively [43]. Therefore, we consider the con-
struction of superpixel-based graphs with 75 vertices. We
use the standard splitting of the MNIST dataset, including
60k and 10K images for training and testing, respectively.
4.2. Convolution operation
We evaluate our convolution operation PolyConv with
different configurations and compare it with various fa-
mous convolutional operations as shown in Table 1. Since
squeezed PolyConv requires fewer computations and less
Conv. 3-subdivision PTQ Params.
max avg Time max avg Time
XConv [36] 84.58 83.31 173ms 85.54 83.85 835ms 473k
SplineConv [14] 93.46 92.72 12ms 93.16 92.66 18ms 111m
ChebConv [11] 93.95 93.42 12ms 93.70 93.11 14ms 712k
GCNConv [27] 93.85 93.38 9ms 93.85 93.30 13ms 179k
GMMConv [43] 93.32 92.68 17ms 93.20 92.36 28ms 4.6m
FiLMConv [7] 94.43 93.89 9ms 94.30 93.76 13ms 1.1m
PolyConv(d= 4) 93.96 93.13 25ms 94.00 92.38 38ms 189k
PolyConv(d= 2)94.52 93.95 13ms 94.40 94.08 18ms 182k
Table 1: Classification accuracy (Acc%), testing time, and
number of parameters in only convolution layers on the
ModelNet-10 with both subdivision strategies for the var-
ious convolution operations in PolyNet.
memory usage compared to the unsqueezed version due to
the less number of learnable parameters, we use it for the
experiments of 3D shape classification where the inputs are
extremely large (roughly 10k vertices). We consider two
different degrees d= 2 and d= 4 for each polynomial
function defined in Eq. 5which each requires six and 21
learnable coefficients for a patch operation, respectively.
The results on MoldelNet-10 with both subdivision strate-
gies show that PolyConv with degree d= 2 achieves rel-
atively higher performance than degree d= 4, which can
be due to its straightforward and easier to learn structure
for approximating the distributions. Furthermore, we eval-
uate PolyNet on ModelNet-10 by replacing PolyConv with
well-known convolutions such as ChebConv [11], GCN-
Conv [27], GMMConv [43], SplineConv [14], XConv [36],
and FiLMConv [7]. We show that PolyConv with degree
d= 2 achieves superior performances compared to all
mentioned convolutions for both subdivision strategies. To-
wards a better understanding of the PDFs, we visualize the
learned joint PDFs f(x, y)and the marginal PDFs fx(x)
of squeezed PolyConv for polynomial functions of degree
d= 2 which are learned on the ModelNet-10 dataset with
012
345
Figure 6: The graphs of the MNIST dataset. Visualization
of handwritten digits in the MNIST dataset individually rep-
resented as the graph of superpixel with 75 vertices.
the 3-subdivision in Figure 5. The results illustrate the di-
versity of learned PDFs among different input channels and
different layers of PolyNet.
On the other hand, we evaluate our unsqueezed Poly-
Conv and compare it with the squeezed PolyConv for the
polynomial functions of degree d= 2 on a classical task
of handwritten digit classification in the graph represen-
tation of the MNIST dataset [43] with 75 vertices. De-
spite the simplicity of underlying images, this is a challeng-
ing task due to the lack of a regular grid structure among
the nodes, as shown in Figure 6. We use three convolu-
tional layers (256,256,256) and three fully connected lay-
ers (1024,1024,10) in the network architecture. We employ
both unsqueezed and squeezed PolyConvs as the convolu-
tion operations and the graclus clustering for the pooling
procedure. We use the position information of the vertices
as the extra features. We show in Table 2that PolyConv
outperforms the existing methods, which demonstrates the
strengths of PolyConv to learn features from irregular data
as well as semi-regular data. Moreover, we show that while
the squeezed version of PolyConv achieves only slightly
lower performance, it requires 135k learnable parameters
in PolyConv layers, which is much more efficient than un-
squeezed PolyConv with 792k parameters. Note that the
average testing time for squeezed PolyConv on the samples
of the 75 Superpixel MNIST is 1.4 ms, while it is 12.4 ms
for the unsquzeed PolyConv.
4.3. Pooling layers
To show the benefits of our PolyPool operation, we ap-
ply PolyNet with various configurations of the pooling op-
eration and input data type and compare the results in Ta-
ble 3. We consider two different data representations, in-
cluding data with and without PolyShape processing for the
ModelNet-10 dataset. Our experiments demonstrate that
PolyPool with PolyShape representation can effectively im-
prove the performance, especially by the pooling based on
3-subdivision. Moreover, we employ a three-level gra-
clus [12] as an efficient clustering algorithm on the data
without PolyShape representation. However, the results
show lower accuracy when we use pooling based on the gra-
clus clustering compared to both 3-subdivision and PTQ.
We interpret the accuracy gap as the effect of losing struc-
tural information of 3D shapes by applying the graclus clus-
tering. Note that we compute the maximum value (max-
pool) of each local patch for all pooling strategies.
4.4. 3D shape classification
To improve the 3D shape classification performance, we
combine the output of the last PolyConv layer for both sub-
divisions by taking an average over their features. We com-
pare the classification results of our PolyNet, with the re-
cent state-of-the-art methods in Table 4on the ModelNet-10
and ModelNet-40 datasets. We note that our PolyNet out-
performs all the mesh-based and voxel-based approaches
on the classification task and achieves comparable perfor-
mance to the methods based on point clouds. The perfor-
mance gaps between the 2D projection-based methods and
other methods are due to utilizing pre-trained networks on
a large number of images. Moreover, images include tex-
ture information produced by lights and shadows, while the
other representations suffer from a lack of such information.
Method Acc. (max)
MoNet [43] 91.11
SplineCNN [14] 95.22
GCGP [59] 95.80
GAT [4] 96.19
PNCNN [15] 98.76
PolyConv (squeezed) 98.39
PolyConv (unsqueezed) 98.95
Table 2: Classification accuracy (Acc%) of various methods
and our PolyConv operations on a superpixel representation
of the MNIST dataset with 75 vertices.
Pooling Poly Accuracy Time Num.
Shape max avg
No pooling 794.11 92.69 22ms 2.8k
Graclus [12]794.14 92.73 16ms 2.8k
PolyPool(PTQ) 394.40 94.08 18ms 25.7k
PolyPool(3-sub) 394.52 93.95 13ms 10.8k
Table 3: Classification accuracy (Acc%), average testing
time, and average number of vertices for various types of
poolings and data with and without PolyShape processing.
Rep. Method ModelNet-10ModelNet-40
Acc mAP Acc mAP
2D
DeepPano [52] 85.45 84.18 77.63 76.81
Projection
MVCNN [56] - - 90.10 79.50
PANORAMA-ENN [50] 96.85 93.28 95.56 86.34
SPNet VE [63] 97.25 94.20 92.63 85.21
RotationNet [24]98.46 -97.37 -
voxel
3D ShapeNets [64] 83.54 68.26 77.32 49.23
grid
VoxNet [41] 92.00 - 83.00 -
VRN [6] 93.61 - 91.33 -
FusionNet [21] 93.11 - 90.80 -
LP-3DCNN [29]94.40 -92.10 -
Point
PointNet [47] - - 89.20 -
cloud
PointNet++ [48] - - 91.90 -
SO-Net [33]95.50 - 90.80 -
KCNet [51] 94.40 - 91.00 -
PCNN [3] 94.90 - 92.30 -
SpiderCNN [62] - - 92.40 -
PointCNN [35] - - 92.50 -
DGCNN [1] - - 92.90 -
KPConv [58] - - 92.90 -
RS-CNN [38] - - 93.60 -
Polygon
SPH [25] 79.79 44.05 68.23 33.26
Mesh
Geometry Image [54] 88.40 74.90 83.90 51.30
MeshNet [13] - - 91.90 81.90
Cross-atlas [34] 91.20 - 87.50 -
SNGC [18] - - 91.60 -
MeshWalker [30] - - 92.30 -
PolyNet (3),(d=2) 94.52 83.91 92.14 82.36
PolyNet (PTQ),(d=2) 94.40 83.84 92.06 81.91
PolyNet (PTQ,3),(d=2)94.93 84.62 92.42 82.86
Table 4: Classification accuracy (Acc%) and mean Aver-
age Precision (mAP%) of PolyNet compared to the state-
of-the-art methods based on different representations on the
ModelNet-10 and the ModelNet-40 datasets.
4.5. 3D shape retrieval
We also evaluate and compare PolyNet in the retrieval
task with previous methods. We extract the output after
the softmax, measure similarities between the query and
the retrieved shapes by the L1 norm, and rank the relevant
shapes. We use mAP to quantitatively compare our retrieval
approach to the related methods on both the ModelNet-10
and the ModelNet-40 datasets, as shown in Table 4. We out-
perform all previously evaluated methods on the retrieval
task based on polygon mesh and voxel grid representations.
Lastly, we show some retrieved shapes in a ranked order for
the given queries on the ModelNet-10 trained by squeezed
PolyConvs, including polynomial functions of degree d= 2
in Figure 7. We illustrate that our method can retrieve vi-
sually similar shapes even when the query and retrieved
shapes are in different categories (e.g., retrieved table for
Figure 7: Retrieval results. This figure demonstrates the re-
trieved shapes for the given queries using PolyNet. The blue
models in the first column are the queries. The retrieved re-
sults in green are from the same category as the query, while
the results in red are from different categories. From left to
right, the results are ordered with a descending rank.
the query desk and nightstand for the dresser).
5. Conclusion
In this paper, we propose PolyNet, a DNN-based method
consists of PolyConv and PolyPool operations to locally
learn and aggregate the information on the surface of 3D
shapes. In PolyConv, we utilize polynomial functions to
learn a continuous distribution as the convolutional filters,
which is invariant to the variation in the degree of vertices,
their permutations, and their pairwise distances. Moreover,
we design PolyShape with a multi-resolution structure that
enables applying PolyPool operation without missing ge-
ometrical structures after each layer. Our comprehensive
evaluations of PolyNet across classification and retrieval
tasks and the theoretical analysis indicating the invariant
properties of PolyConv demonstrate its strength and supe-
riority over most of the previous methods. In future works,
we will explore the applications of PolyNet in 3D shape seg-
mentation and PolyConv in image-based computer vision
tasks where there are no regular neighboring connectives.
Acknowledgement
This work was supported by IITP grant funded by the
Korea government(MSIT) [NO.2021-0-01343, Artificial In-
telligence Graduate School Program (Seoul National Uni-
versity)]
References
[1] Dgcnn: A convolutional neural network over large-scale la-
beled graphs. Neural Networks, 2018. 8
[2] James Atwood and Donald F. Towsley. Diffusion-
convolutional neural networks. In NIPS, 2016. 2
[3] Matan Atzmon, Haggai Maron, and Yaron Lipman. Point
convolutional neural networks by extension operators. ACM
Trans. Graph., 2018. 8
[4] P. H. C. Avelar, A. R. Tavares, T. L. T. da Silveira, C. R. Jung,
and L. C. Lamb. Superpixel image classification with graph
attention networks. In 2020 33rd SIBGRAPI Conference on
Graphics, Patterns and Images (SIBGRAPI), 2020. 7
[5] Giorgos Bouritsas, Sergiy V. Bokhnyak, Stylianos Ploumpis,
Michael M. Bronstein, and Stefanos Zafeiriou. Neural 3d
morphable models: Spiral convolutional networks for 3d
shape representation learning and generation. ICCV, 2019. 2
[6] Andr´
e Brock, Theodore Lim, James M. Ritchie, and Nick
Weston. Generative and discriminative voxel modeling with
convolutional neural networks. CoRR, 2016. 8
[7] Marc Brockschmidt. GNN-FiLM: Graph neural networks
with feature-wise linear modulation. In Proceedings of the
37th International Conference on Machine Learning, 2020.
6
[8] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le-
Cun. Spectral networks and locally connected networks on
graphs. CoRR, 2013. 2
[9] Taco S. Cohen, Mario Geiger, Jonas K ¨
ohler, and Max
Welling. Spherical cnns. CoRR, 2018. 1,2
[10] Micha¨
el Defferrard, Xavier Bresson, and Pierre Van-
dergheynst. Convolutional neural networks on graphs with
fast localized spectral filtering. In NIPS. 2016. 2
[11] Micha¨
el Defferrard, Xavier Bresson, and Pierre Van-
dergheynst. Convolutional neural networks on graphs with
fast localized spectral filtering. In Proceedings of the 30th
International Conference on Neural Information Processing
Systems, 2016. 6
[12] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts
without eigenvectors a multilevel approach. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 2007. 7
[13] Yutong Feng, Yifan Feng, Haoxuan You, Xibin Zhao, and
Yue Gao. Meshnet: Mesh neural network for 3d shape rep-
resentation. In AAAI, 2018. 1,8
[14] M. Fey, J. E. Lenssen, F. Weichert, and H. M¨
uller. Splinecnn:
Fast geometric deep learning with continuous b-spline ker-
nels. 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2018. 2,6,7
[15] Marc Finzi, R. Bondesan, and M. Welling. Proba-
bilistic numeric convolutional neural networks. ArXiv,
abs/2010.10876, 2020. 7
[16] Takahiko Furuya and Ryutarou Ohbuchi. Deep semantic
hashing of 3d geometric features for efficient 3d model re-
trieval. In CGIC, 2017. 2
[17] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol
Vinyals, and George E. Dahl. Neural message passing for
quantum chemistry. ArXiv, 2017. 2
[18] Niv Haim, Nimrod Segol, Heli Ben-Hamu, Haggai Maron,
and Y. Lipman. Surface networks via general covers. 2019
IEEE/CVF International Conference on Computer Vision
(ICCV), 2019. 8
[19] Z. Han, M. Shang, Z. Liu, C. Vong, Y. Liu, M. Zwicker,
J. Han, and C. L. P. Chen. Seqviews2seqlabels: Learning
3d global features via aggregating sequential views by rnn
with attention. IEEE Transactions on Image Processing, (2),
2019. 2
[20] Rana Hanocka, Amir Hertz, Noa Fish, Raja Giryes, Shachar
Fleishman, and Daniel Cohen-Or. Meshcnn: a network with
an edge. ACM Trans. Graph., 2019. 1
[21] Vishakh Hegde and Reza Bosagh Zadeh. Fusionnet: 3d ob-
ject classification using multiple data representations. ArXiv,
2016. 8
[22] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep con-
volutional networks on graph-structured data. CoRR, 2015.
2
[23] Jingwei Huang, Haotian Zhang, Li Yi, Thomas Funkhouser,
Matthias Nießner, and Leonidas J Guibas. Texturenet:
Consistent local parametrizations for learning from high-
resolution signals on meshes. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
2019. 2
[24] A. Kanezaki, Y. Matsushita, and Y. Nishida. Rotationnet:
Joint object categorization and pose estimation using mul-
tiviews from unsupervised viewpoints. In CVPR, 2018. 2,
8
[25] Michael M. Kazhdan, Thomas A. Funkhouser, and Szymon
Rusinkiewicz. Rotation invariant spherical harmonic repre-
sentation of 3d shape descriptors. In Symposium on Geome-
try Processing, 2003. 8
[26] Thomas Kipf and Max Welling. Semi-supervised classifica-
tion with graph convolutional networks. ArXiv, 2016. 2
[27] Thomas N. Kipf and Max Welling. Semi-supervised classi-
fication with graph convolutional networks. In International
Conference on Learning Representations (ICLR), 2017. 6
[28] Leif Kobbelt. 3-subdivisionn. In Proceedings of the 27th
Annual Conference on Computer Graphics and Interactive
Techniques, 2000. 5
[29] Sudhakar Kumawat and Shanmuganathan Raman. LP-
3DCNN: unveiling local phase in 3d convolutional neural
networks. In CVPR, 2019. 8
[30] Alon Lahav and A. Tal. Meshwalker: Deep mesh under-
standing by random walks. ACM Trans. Graph., 2020. 8
[31] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. Proceed-
ings of the IEEE, 1998. 6
[32] Ron Levie, Federico Monti, Xavier Bresson, and Michael M.
Bronstein. Cayleynets: Graph convolutional neural networks
with complex rational spectral filters. IEEE Transactions on
Signal Processing, 2019. 2
[33] Jiaxin Li, Ben M. Chen, and Gim Hee Lee. So-net: Self-
organizing network for point cloud analysis. CoRR, 2018.
8
[34] S. Li, Z. Luo, M. Zhen, Y. Yao, T. Shen, T. Fang, and L.
Quan. Cross-atlas convolution for parameterization invariant
learning on textured mesh surface. In CVPR, 2019. 8
[35] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di,
and Baoquan Chen. In Advances in Neural Information Pro-
cessing Systems, 2018. 8
[36] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di,
and Baoquan Chen. Pointcnn: Convolution on x-transformed
points. In Advances in Neural Information Processing Sys-
tems, 2018. 6
[37] R. Litman and A. M. Bronstein. Learning spectral descrip-
tors for deformable shape correspondence. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 2014. 2
[38] Yongcheng Liu, Bin Fan, Shiming Xiang, and Chunhong
Pan. Relation-shape convolutional neural network for point
cloud analysis (cvpr 2019 oral & best paper finalist). 2019.
8
[39] Charles Loop. Smooth subdivision surfaces based on tri-
angles. Master’s thesis, University of Utah, Department of
Mathematics, 1987. 5
[40] Jonathan Masci, Davide Boscaini, Michael M. Bronstein,
and Pierre Vandergheynst. Geodesic convolutional neural
networks on riemannian manifolds. In ICCVW, 2015. 1,
2
[41] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
neural network for real-time object recognition. In IROS,
2015. 1,2,8
[42] A. Micheli. Neural network for graphs: A contextual con-
structive approach. IEEE Transactions on Neural Networks,
2009. 2
[43] F. Monti, D. Boscaini, J. Masci, E. Rodol`
a, J. Svoboda, and
M. M. Bronstein. Geometric deep learning on graphs and
manifolds using mixture model cnns. In CVPR, 2017. 1,2,
6,7
[44] J. Nagi, F. Ducatelle, G. A. Di Caro, D. Cires¸an, U. Meier,
A. Giusti, F. Nagi, J. Schmidhuber, and L. M. Gambardella.
Max-pooling convolutional neural networks for vision-based
hand gesture recognition. In ICSIPA, 2011. 5
[45] Mathias Niepert, Mohamed Ahmed, and Konstantin
Kutzkov. Learning convolutional neural networks for graphs.
ArXiv, 2016. 2
[46] Fr´
ed´
eric Payan, C´
eline Roudet, and Basile Sauvage. Semi-
regular triangle remeshing: A comprehensive study. In Com-
puter Graphics Forum, 2015. 4
[47] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and
Leonidas J. Guibas. Pointnet: Deep learning on point sets
for 3d classification and segmentation. CVPR, 2016. 1,2,8
[48] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J.
Guibas. Pointnet++: Deep hierarchical feature learning on
point sets in a metric space. In NIPS, 2017. 1,2,8
[49] Gernot Riegler, Ali O. Ulusoy, and Andreas Geiger. Octnet:
Learning deep 3d representations at high resolutions. CVPR,
2016. 2
[50] Konstantinos Sfikas, Theoharis Theoharis, and Ioannis
Pratikakis. Exploiting the PANORAMA Representation for
Convolutional Neural Network Classification and Retrieval.
In Eurographics Workshop on 3D Object Retrieval, 2017. 1,
2,8
[51] Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Min-
ing point cloud local structures by kernel correlation and
graph pooling. In CVPR, 2018. 8
[52] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep
panoramic representation for 3-d shape recognition. IEEE
Signal Processing Letters, 2015. 1,2,8
[53] SideFX. Houdini. 6
[54] Ayan Sinha, Jing Bai, and Karthik Ramani. Deep learning
3d shape surfaces using geometry images. In ECCV, 2016.
8
[55] David Stutz and Andreas Geiger. Learning 3d shape comple-
tion under weak supervision. 2018. 4
[56] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and
Erik G. Learned-Miller. Multi-view convolutional neural
networks for 3d shape recognition. In Proc. ICCV, 2015.
1,2,8
[57] Gusi Te, Wei Hu, Amin Zheng, and Zongming Guo. Rgcnn:
Regularized graph cnn for point cloud segmentation. In ACM
MM, 2018. 2
[58] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud,
Beatriz Marcotegui, Franc¸ois Goulette, and Leonidas J.
Guibas. Kpconv: Flexible and deformable convolution for
point clouds. CoRR, 2019. 8
[59] Ian Walker and Ben Glocker. Graph convolutional Gaussian
processes. In Proceedings of the 36th International Confer-
ence on Machine Learning, 2019. 7
[60] Chu Wang, Babak Samari, and Kaleem Siddiqi. Local spec-
tral graph convolution for point set feature learning. In
ECCV, 2018. 2
[61] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun,
and Xin Tong. O-cnn: Octree-based convolutional neural
networks for 3d shape analysis. ACM Trans. Graph., 2017.
2
[62] Yifan Xu, T. Fan, Mingye Xu, L. Zeng, and Y. Qiao. Spider-
cnn: Deep learning on point sets with parameterized convo-
lutional filters. ArXiv, 2018. 8
[63] Mohsen Yavartanoo, Euyoung Kim, and Kyoung Mu Lee.
Spnet: Deep 3d object classification and retrieval using stere-
ographic projection. In ACCV, 2018. 1,2,8
[64] Zhirong Wu, S. Song, A. Khosla, Fisher Yu, Linguang
Zhang, Xiaoou Tang, and J. Xiao. 3d shapenets: A deep
representation for volumetric shapes. In CVPR, 2015. 1,2,
6,8
Supplementary Material for
PolyNet: Polynomial Neural Network for 3D Shape Recognition with PolyShape
Representation
Mohsen Yavartanoo1Shih-Hsuan Hung2Reyhaneh Neshatavar1Yue Zhang2Kyoung Mu Lee1
1SNU ECE & ASRI 2Oregon State University
{myavartanoo,reyhanehneshat,kyoungmu}@snu.ac.kr {hungsh,zhangyue}@oregonstate.edu
S. PolyNet analysis
In this section, we provide more mathematical and qual-
itative analysis of our proposed PolyNet with additional
explanations of PolyConv operation and the preprocessing
procedure of PolyShape and its results. First, we analyze
some equations of PolyConv mentioned in the main paper
for better understanding. Then we discuss PolyShape pre-
processing in detail and show its step-by-step results.
S.1. Expansion of PolyConv
We derive the compact form of the polynomial function
f(x, y)defined in Eq. 5in the main paper for d= 2 and
d= 4 as in the following Eq. S1 and Eq. S2, respectively.
XTAX
=1x y
A11 A12 A13
A21 A22 A23
A31 A32 A33
1
x
y
=A11
+ (A12 +A21)x+ (A13 +A31 )y
+A22x2+ (A23 +A32 )xy +A33y2
=a0,0+a1,0x+a0,1y+a2,0x2+a1,1xy
+a0,2y2
=X
0i,j,i+j2
ai,j xiyj=f(x, y).
(S1)
XTAX
=1x y x2xy y2
A11 A12 A13 A14 A15 A16
A21 A22 A23 A24 A25 A26
A31 A32 A33 A34 A35 A36
1
x
y
x2
xy
y2
=A11
+ (A12 +A21)x+ (A13 +A31 )y
+ (A22 +A15 +A51)x2+ (A23 +A32
+A14 +A41)xy + (A33 +A16 +A61 )y2
+ (A25 +A52)x3+ (A35 +A53 +A24
+A42)x2y+ (A26 +A62 +A34 +A43 )xy2
+ (A36 +A63)y3
+A55x4+ (A45 +A54 )x3y+ (A44 +A56
+A65)x2y2+ (A46 +A64 )xy3+A66y4
=a0,0+a1,0x+a0,1y
+a2,0x2+a1,1xy +a0,2y2
+a3,0x3+a2,1x2y+a1,2xy2+a0,3y3
+a4,0x4+a3,1x3y+a2,2x2y2+a1,3xy3
+a0,4y4
=X
0i,j,i+j4
ai,j xiyj=f(x, y).
(S2)
Note that we parameterize the matrix Aby the matrix
Band learn the matrix Binstead of the matrix A. The
parametrized matrix Afor the polynomial functions of de-
File: Input
File: Output
File: Output
File: Output
File: Output
File: Output
File: Output
File: Output
File: Input
1x1x1 1x1x1
400
CAD Model Cleaned Model
matchsize
clean polyreduce
ray
normal
normal
normal
normal
normal
normal
normal
clean
matchsize
tridividesubdivide
Coarse
PTQ1
PTQ2
PTQ3
Sqrt1
Sqrt2
Sqrt3
shape fitting
ray
shape fitting
ray
shape fitting
ray
shape fitting
ray
PTQ
subdivide
PTQ
subdivide
PTQ
ray
shape fitting
shape fitting
ray
shape fitting
Primal Triangle Quadrisection
Process √3-subdivision Process
√3-subdivision
tridivide
√3-subdivision
tridivide
√3-subdivision
Figure S1: PolyShape processing in Houdini. Each box refers to a node with the same name in the software.
gree d= 2 is as Eq. S3;
A=BBT
=
A11 A12 A13
A21 A22 A23
A31 A32 A33
=
B11 B12 B13
B12 B22 B23
B13 B23 B33
B11 B12 B13
B12 B22 B23
B13 B23 B33
=A11 =B2
11 +B2
12 +B2
13
=A12 =B11B12 +B12B22 +B13 B23
=A13 =B11B13 +B12B23 +B13 B33
=A21 =B12B11 +B22B12 +B23 B13
=A22 =B2
12 +B2
22 +B2
23
=A23 =B12B13 +B22B23 +B23 B33
=A31 =B13B11 +B23B12 +B33 B13
=A32 =B13B12 +B23B22 +B33 B23
=A33 =B2
13 +B2
23 +B2
33,
(S3)
where Bis a learnable symmetric matrix.
S.2. Details of PolyShape Processing
We provide the flowchart of the PolyShape processing in
Houdini software in Figure S1. Given a 3D CAD model and
its cleaned model by mesh fusion, we first resize the mod-
els and remove the unused points with the matchsize and
clean nodes. Next, we generate the coarse mesh by reduc-
ing the number of vertices to 400 with polyreduce and fit-
ting the shape to the 3D CAD model with ray. We create the
multiresolution of the PolyShape by subdividing the coarse
mesh three times with primal triangle quadrisection (PTQ,
subdivide) or 3-subdivision (tridivide), respectively. At
each iteration of these subdivisions, we fit the generated
mesh to the given 3D CAD model to maintain the details
of the original shape. Figure S2 and Figure S3 show the
resulting PolyShapes of 3-subdivision and PTQ with the
same input models, respectively. The PolyShapes generated
by 3-subdivision have fewer faces than the shapes created
by PTQ at the same level of details. For the highest resolu-
tion of the PolyShapes, 3-subdivision creates 0.43 fewer
faces than PTQ on average. Therefore, the 3-subdivision
provides a more efficient representation for storage and the
computation of the classification.
V= 15398
F= 19254
V= 2500
F= 4996
V= 400
F= 796
V= 1198
F= 2394
V= 3592
F= 7182
V= 10774
F= 21546
V= 584
F= 676
V= 2502
F= 5000
V= 400
F= 796
V= 1196
F= 2388
V= 3584
F= 7164
V= 10748
F= 21492
V= 584
F= 676
V= 2502
F= 5000
V= 400
F= 796
V= 1196
F= 2388
V= 3584
F= 7164
V= 10748
F= 21492
V= 1988
F= 1376
V= 2273
F= 4542
V= 400
F= 816
V= 1216
F= 2448
V= 3664
F= 7344
V= 11008
F= 22032
V= 3958
F= 3132
V= 2502
F= 5000
V= 400
F= 796
V= 1196
F= 2388
V= 3584
F= 7164
V= 10748
F= 21492
V= 2482
F= 1794
V= 2500
F= 5000
V= 400
F= 800
V= 1200
F= 2400
V= 3600
F= 7200
V= 10800
F= 21600
V= 41528
F= 22000
V= 2037
F= 4074
V= 400
F= 942
V= 1342
F= 2826
V= 4168
F= 8478
V= 12646
F= 25434
V= 7808
F= 13980
V= 2496
F= 4988
V= 400
F= 798
V= 1198
F= 2394
V= 3592
F= 7182
V= 10774
F= 21546
V= 1074
F= 694
V= 2502
F= 5000
V= 400
F= 796
V= 1196
F= 2388
V= 3584
F= 7164
V= 10748
F= 21492
V= 4204
F= 5364
3D CAD
V= 2385
F= 4770
Cleaned
V= 400
F= 788
Coarse
V= 1188
F= 2364
Sqrt1
V= 3552
F= 7092
Sqrt2
V= 10644
F= 21276
Sqrt3
Figure S2: PolyShape representation. PolyShape processing results on some samples of ModelNet-10 dataset based on 3-
subdivision. Sqrt1 to Sqrt3 refer to the output of the PolyShape procedure after each level of subdivision. Note that Vand F
refer to the number of vertices and faces for each shape.
V= 15398
F= 19254
V= 2500
F= 4996
V= 400
F= 796
V= 1594
F= 3184
V= 6370
F= 12736
V= 25474
F= 50944
V= 584
F= 676
V= 2502
F= 5000
V= 400
F= 796
V= 1594
F= 3184
V= 6370
F= 12736
V= 25474
F= 50944
V= 584
F= 676
V= 2502
F= 5000
V= 400
F= 796
V= 1594
F= 3184
V= 6370
F= 12736
V= 25474
F= 50944
V= 1988
F= 1376
V= 2273
F= 4542
V= 400
F= 816
V= 1594
F= 3184
V= 6370
F= 12736
V= 25474
F= 50944
V= 3958
F= 3132
V= 2502
F= 5000
V= 400
F= 796
V= 1594
F= 3184
V= 6370
F= 12736
V= 25474
F= 50944
V= 2482
F= 1794
V= 2500
F= 5000
V= 400
F= 800
V= 1600
F= 3200
V= 6400
F= 12800
V= 25600
F= 51200
V= 41528
F= 22000
V= 2037
F= 4074
V= 400
F= 942
V= 1600
F= 3200
V= 6400
F= 12800
V= 25600
F= 51200
V= 7808
F= 13980
V= 2496
F= 4988
V= 400
F= 798
V= 1594
F= 3184
V= 6370
F= 12736
V= 25474
F= 50944
V= 1074
F= 694
V= 2502
F= 5000
V= 400
F= 796
V= 1594
F= 3184
V= 6370
F= 12736
V= 25474
F= 50944
V= 4204
F= 5364
3D CAD
V= 2385
F= 4770
Cleaned
V= 400
F= 788
Coarse
V= 1600
F= 3200
PTQ1
V= 6400
F= 12800
PTQ2
V= 25600
F= 51200
PTQ3
Figure S3: PolyShape representation. PolyShape processing results on some samples of ModelNet-10 dataset based on PTQ.
PTQ1 to PTQ3 refer to the output of the PolyShape procedure after each level of subdivision. Note that Vand Frefer to the
number of vertices and faces for each shape.
... A CNN is a feedforward neural network with convolutional computation that has excelled in image algorithms due to its exceptional learning power. From the traditional LeNet [35] and AlexNet [7] to the more modern MobileNet [36], ResNet [37], Nasnet [38], Efficient [39], Polynet [40], etc, they have all demonstrated excellent performance. Even so, the network model's classification layer in image classification tasks continues to use the softmax function and cross-entropy loss for back-propagation, which can cause the model to become over-fitted and degrade the model's performance. ...
Full-text available
Article
Recent advances in convolutional neural networks (CNNs) for image feature extraction have achieved extraordinary performance, but back-propagation algorithms tend to fall into local minima. To alleviate this problem, this paper proposes a coordinate attention-support vector machine-convolutional neural network (CASVM). This proposed to enhance the model’s ability by introducing coordinate attention while obtaining enhanced image features. Training is carried out by back-propagating the loss function of support vector machines (SVMs) to improve the generalization capability, which can effectively avoid falling into local optima. The image datasets used in this study for benchmark experiments are Fashion-MNIST, Cifar10, Cifar100, and Animal10. Experimental results show that compared with softmax, CASVM can improve the image classification accuracy of the original model under different image resolution datasets. Under the same structure, CASVM shows better performance and robustness and has higher accuracy. Under the same network parameters, the loss function of CASVM enables the model to realize a lower loss value. Among the standard CNN models, the highest accuracy rate can reach 99%, and the optimal number of accuracy indicators is 5.5 times that of softmax, whose accuracy rate can be improved by up to 56%.
... Mesh Pooling and Unpooling A stacked dilated mesh convolution block is proposed in [253] to inflate the receptive field of graph convolution kernels. PolyNet [254] supports convolution and pooling operations that are invariant to the scale, size, and perturbations of local patches. He et al. [255] improved the graph convolution operation by learning direction sensitive geometric features from mesh surfaces. ...
Full-text available
Preprint
Graph Neural Networks (GNNs) have gained momentum in graph representation learning and boosted the state of the art in a variety of areas, such as data mining (e.g., social network analysis and recommender systems), computer vision (e.g., object detection and point cloud learning), and natural language processing (e.g., relation extraction and sequence learning), to name a few. With the emergence of Transformers in natural language processing and computer vision, graph Transformers embed a graph structure into the Transformer architecture to overcome the limitations of local neighborhood aggregation while avoiding strict structural inductive biases. In this paper, we present a comprehensive review of GNNs and graph Transformers in computer vision from a task-oriented perspective. Specifically, we divide their applications in computer vision into five categories according to the modality of input data, i.e., 2D natural images, videos, 3D data, vision + language, and medical images. In each category, we further divide the applications according to a set of vision tasks. Such a task-oriented taxonomy allows us to examine how each task is tackled by different GNN-based approaches and how well these approaches perform. Based on the necessary preliminaries, we provide the definitions and challenges of the tasks, in-depth coverage of the representative approaches, as well as discussions regarding insights, limitations, and future directions.
Chapter
Many problems in computer vision and machine learning can be cast as learning on hypergraphs that represent higher-order relations. Recent approaches for hypergraph learning extend graph neural networks based on message passing, which is simple yet fundamentally limited in modeling long-range dependencies and expressive power. On the other hand, tensor-based equivariant neural networks enjoy maximal expressiveness, but their application has been limited in hypergraphs due to heavy computation and strict assumptions on fixed-order hyperedges. We resolve these problems and present Equivariant Hypergraph Neural Network (EHNN), the first attempt to realize maximally expressive equivariant layers for general hypergraph learning. We also present two practical realizations of our framework based on hypernetworks (EHNN-MLP) and self-attention (EHNN-Transformer), which are easy to implement and theoretically more expressive than most message passing approaches. We demonstrate their capability in a range of hypergraph learning problems, including synthetic k-edge identification, semi-supervised classification, and visual keypoint matching, and report improved performances over strong message passing baselines. Our implementation is available at https://github.com/jw9730/ehnn.KeywordsHypergraph neural networkGraph neural networkPermutation equivarianceSemi-supervised classificationKeypoint matching
Full-text available
Conference Paper
This paper presents a methodology for image classification using Graph Neural Network (GNN) models. We transform the input images into region adjacency graphs (RAGs), in which regions are superpixels and edges connect neighboring superpixels. Our experiments suggest that Graph Attention Networks (GATs), which combine graph convolutions with self-attention mechanisms, outperforms other GNN models. Although raw image classifiers perform better than GATs due to information loss during the RAG generation, our methodology opens an interesting avenue of research on deep learning beyond rectangular-gridded images, such as 360-degree field of view panoramas. Traditional convolutional kernels of current state-of-the-art methods cannot handle panoramas, whereas the adapted superpixel algorithms and the resulting region adjacency graphs can naturally feed a GNN, without topology issues.
Full-text available
Article
Mesh is an important and powerful type of data for 3D shapes and widely studied in the field of computer vision and computer graphics. Regarding the task of 3D shape representation, there have been extensive research efforts concentrating on how to represent 3D shapes well using volumetric grid, multi-view and point cloud. However, there is little effort on using mesh data in recent years, due to the complexity and irregularity of mesh data. In this paper, we propose a mesh neural network, named MeshNet, to learn 3D shape representation from mesh data. In this method, face-unit and feature splitting are introduced, and a general architecture with available and effective blocks are proposed. In this way, MeshNet is able to solve the complexity and irregularity problem of mesh and conduct 3D shape representation well. We have applied the proposed MeshNet method in the applications of 3D shape classification and retrieval. Experimental results and comparisons with the state-of-the-art methods demonstrate that the proposed MeshNet can achieve satisfying 3D shape classification and retrieval performance, which indicates the effectiveness of the proposed method on 3D shape representation.
Article
Most attempts to represent 3D shapes for deep learning have focused on volumetric grids, multi-view images and point clouds. In this paper we look at the most popular representation of 3D shapes in computer graphics - -a triangular mesh - -and ask how it can be utilized within deep learning. The few attempts to answer this question propose to adapt convolutions & pooling to suit Convolutional Neural Networks (CNNs). This paper proposes a very different approach, termed MeshWalker to learn the shape directly from a given mesh. The key idea is to represent the mesh by random walks along the surface, which "explore"the mesh's geometry and topology. Each walk is organized as a list of vertices, which in some manner imposes regularity on the mesh. The walk is fed into a Recurrent Neural Network (RNN) that "remembers"the history of the walk. We show that our approach achieves state-of-the-art results for two fundamental shape analysis tasks: shape classification and semantic segmentation. Furthermore, even a very small number of examples suffices for learning. This is highly important, since large datasets of meshes are difficult to acquire.
Article
Polygonal meshes provide an efficient representation for 3D shapes. They explicitly captureboth shape surface and topology, and leverage non-uniformity to represent large flat regions as well as sharp, intricate features. This non-uniformity and irregularity, however, inhibits mesh analysis efforts using neural networks that combine convolution and pooling operations. In this paper, we utilize the unique properties of the mesh for a direct analysis of 3D shapes using MeshCNN, a convolutional neural network designed specifically for triangular meshes. Analogous to classic CNNs, MeshCNN combines specialized convolution and pooling layers that operate on the mesh edges, by leveraging their intrinsic geodesic connections. Convolutions are applied on edges and the four edges of their incident triangles, and pooling is applied via an edge collapse operation that retains surface topology, thereby, generating new mesh connectivity for the subsequent convolutions. MeshCNN learns which edges to collapse, thus forming a task-driven process where the network exposes and expands the important features while discarding the redundant ones. We demonstrate the effectiveness of MeshCNN on various learning tasks applied to 3D meshes.